% Copyright (C) 2014-2020 by Thomas Auzinger <thomas@auzinger.name>

\documentclass[draft,final]{vutinfth} % Remove option 'final' to obtain debug information.

% Load packages to allow in- and output of non-ASCII characters.
\usepackage{lmodern}        % Use an extension of the original Computer Modern font to minimize the use of bitmapped letters.
\usepackage[T1]{fontenc}    % Determines font encoding of the output. Font packages have to be included before this line.
\usepackage[utf8]{inputenc} % Determines encoding of the input. All input files have to use UTF8 encoding.

% Extended LaTeX functionality is enables by including packages with \usepackage{...}.
\usepackage{amsmath}    % Extended typesetting of mathematical expression.
\usepackage{amssymb}    % Provides a multitude of mathematical symbols.
\usepackage{mathtools}  % Further extensions of mathematical typesetting.
\usepackage{microtype}  % Small-scale typographic enhancements.
\usepackage[inline]{enumitem} % User control over the layout of lists (itemize, enumerate, description).
\usepackage{multirow}   % Allows table elements to span several rows.
\usepackage{booktabs}   % Improves the typesettings of tables.
\usepackage{subcaption} % Allows the use of subfigures and enables their referencing.
\usepackage[ruled,linesnumbered,algochapter]{algorithm2e} % Enables the writing of pseudo code.
\usepackage[usenames,dvipsnames,table]{xcolor} % Allows the definition and use of colors. This package has to be included before tikz.
\usepackage{nag}       % Issues warnings when best practices in writing LaTeX documents are violated.
\usepackage{todonotes} % Provides tooltip-like todo notes.
\usepackage{hyperref}  % Enables cross linking in the electronic document version. This package has to be included second to last.
\usepackage[acronym,toc]{glossaries} % Enables the generation of glossaries and lists fo acronyms. This package has to be included last.
\usepackage{float}     % Control positioning more fine-grained
\usepackage{listings}
\usepackage{minted}
\usepackage{xurl}
\usepackage{placeins}
\usemintedstyle{manni}

% Define convenience functions to use the author name and the thesis title in the PDF document properties.
\newcommand{\authorname}{Norbert Cornelius Tremurici} % The author name without titles.
\newcommand{\thesistitle}{Exploring extensions for a basic RISC-V pipeline implementation to support the execution of an operating system} % The title of the thesis. The English version should be used, if it exists.

% color macros
\newcommand{\catSN}[0]{\textcolor{BrickRed}{SN} }
\newcommand{\catEX}[0]{\textcolor{OliveGreen}{E} }
\newcommand{\catPE}[0]{\textcolor{RoyalBlue}{PE} }
\newcommand{\catCO}[0]{\textcolor{MidnightBlue}{C} }

% Set PDF document properties
\hypersetup{
    pdfpagelayout   = TwoPageRight,           % How the document is shown in PDF viewers (optional).
    linkbordercolor = {Melon},                % The color of the borders of boxes around crosslinks (optional).
    pdfauthor       = {\authorname},          % The author's name in the document properties (optional).
    pdftitle        = {\thesistitle},         % The document's title in the document properties (optional).
    pdfsubject      = {Subject},              % The document's subject in the document properties (optional).
    pdfkeywords     = {a, list, of, keywords} % The document's keywords in the document properties (optional).
}

\setpnumwidth{2.5em}        % Avoid overfull hboxes in the table of contents (see memoir manual).
\setsecnumdepth{subsection} % Enumerate subsections.

\nonzeroparskip             % Create space between paragraphs (optional).
\setlength{\parindent}{0pt} % Remove paragraph identation (optional).

\makeindex      % Use an optional index.
\makeglossaries % Use an optional glossary.
%\glstocfalse   % Remove the glossaries from the table of contents.

% Set persons with 4 arguments:
%  {title before name}{name}{title after name}{gender}
%  where both titles are optional (i.e. can be given as empty brackets {}).
\setauthor{}{\authorname}{}{male}
\setadvisor{Ao.Univ.Prof. Dipl.-Ing. Dr.techn.}{Andreas Steininger}{}{male}

% For bachelor and master theses:
\setfirstassistant{Projektass. Dipl.-Inform.}{Florian Kriebel}{}{male}
%\setsecondassistant{Pretitle}{Forename Surname}{Posttitle}{male}
%\setthirdassistant{Pretitle}{Forename Surname}{Posttitle}{male}

% For dissertations:
%\setfirstreviewer{Pretitle}{Forename Surname}{Posttitle}{male}
%\setsecondreviewer{Pretitle}{Forename Surname}{Posttitle}{male}

% For dissertations at the PhD School and optionally for dissertations:
%\setsecondadvisor{Pretitle}{Forename Surname}{Posttitle}{male} % Comment to remove.

% Required data.
\setregnumber{11907086}
\setdate{16}{08}{2023} % Set date with 3 arguments: {day}{month}{year}.
\settitle{\thesistitle}{Exploring extensions for a basic RISC-V pipeline implementation to support the execution of an operating system} % Sets English and German version of the title (both can be English or German). If your title contains commas, enclose it with additional curvy brackets (i.e., {{your title}}) or define it as a macro as done with \thesistitle.
%\setsubtitle{Optional Subtitle of the Thesis}{Optionaler Untertitel der Arbeit} % Sets English and German version of the subtitle (both can be English or German).

% Select the thesis type: bachelor / master / doctor / phd-school.
% Bachelor:
\setthesis{bachelor}
%
% Master:
%\setthesis{master}
%\setmasterdegree{dipl.} % dipl. / rer.nat. / rer.soc.oec. / master
%
% Doctor:
%\setthesis{doctor}
%\setdoctordegree{rer.soc.oec.}% rer.nat. / techn. / rer.soc.oec.
%
% Doctor at the PhD School
%\setthesis{phd-school} % Deactivate non-English title pages (see below)

% For bachelor and master:
\setcurriculum{Computer Engineering}{Technische Informatik} % Sets the English and German name of the curriculum.

% For dissertations at the PhD School:
\setfirstreviewerdata{Affiliation, Country}
\setsecondreviewerdata{Affiliation, Country}

\begin{document}

\frontmatter % Switches to roman numbering.
% The structure of the thesis has to conform to the guidelines at
%  https://informatics.tuwien.ac.at/study-services

\addtitlepage{naustrian} % German title page (not for dissertations at the PhD School).
\addtitlepage{english} % English title page.
\addstatementpage

%\begin{danksagung*}
%\todo{Ihr Text hier.}
%\end{danksagung*}

%\begin{acknowledgements*}
%\todo{Enter your text here.}
%\end{acknowledgements*}

\begin{kurzfassung}
    
In den letzten Jahren hat der open-source RISC-V Befehlssatz (ISA) und das dazugehörige Ökosystem an Sichtbarkeit in den industriellen und akademischen Sektoren gewonnen.
Es wird auch breitflächig genutzt um Grundlagen der Technischen Informatik zu lehren, vorallem in einem Befehlssatzdesign- und Pipeliningkontext.
Als solches spielt Praxiserfahrung (wie etwa eine Implementierung der Pipeline eines Prozessorkerns) eine entscheidende Rolle in der Erforschung der mit Pipelining verbundenen Probleme.
Während eine grundlegende Pipelineimplementierung es erlaubt, Bare-Metal-Applikationen mit vertretbarem Implementierungs- und Infrastrukturaufwand auszuführen, bleibt das Erweitern eines solchen Systems um die Ausführung komplexerer Applikationen oder gar Betriebssysteme zu erlauben (etwa durch eine komplexere Speicherhierarchie oder fortgeschrittener I/O Möglichkeiten) eine unzumutbare Aufgabe für eine zum Bachelor gehörige Rechnerstrukturlehrveranstaltung.

Obwohl eine breite Vielfalt an kommerziellen und akademischen RISC-V Implementierungen und Projekten existiert, welche zum Teil aufgezählte Funktionen unterstützen, sind diese oft entweder zu kompliziert, erfordern einen zusätzlichen Implementierungsaufwand für mehrere Erweiterungen, oder erfüllen nicht die Anforderungen und Vorgaben für das Interface, die Infrastruktur, die Hardwareplattform oder die Hardwarebeschreibungssprache.
Daher ist das Ziel dieser Arbeit eine potenzielle, minimale und/oder realisierbare Menge an notwendigen Erweiterungen für eine existierende, einfache 5-stufige RISC-V Pipelineimplementierung zu finden um die Ausführung eines Betriebssystems zu unterstützen.
Impliziert ist das Auffinden verfügbarer Betriebssyteme für diese Implementierung, ein Vergleich hinsichtlich Funktionalität sowie derer Vor- und Nachteile, eine Analyse der Trade-offs, eine Abschätzung des dazugehörigen Entwicklungs- und Implementierungsaufwandes hinsichtlich der notwendigen Hardware- und Softwareerweiterungen und zu guter Letzt, eine passende Wahl.

Als Ergebnis dieser Untersuchung wird eine Analyse des aktuellen Zustandes von RISC-V Betriebssytemen bereitgestellt, so wie Prototypimplementierungen einer \textit{Exception}, \textit{Interrupt} und \textit{Trap} Architektur, RISC-V M und Zicsr Erweiterungen und ein einfacher 2-bit Branch Predictor.
Diese Erweiterungen wurden entweder als notwendige oder leistungssteigendere Komponenten für die Unterstützung eines ausgewählten Betriebssystems, FreeRTOS, aufgesetzt in einer einfachen Konfiguration um eine 2-thread Applikation auszuführen, identifiziert.

\end{kurzfassung}

\begin{abstract}

In recent years, the open-source RISC-V Instruction Set Architecture and the corresponding ecosystem gained increasing visibility in the industry and academic sectors.
It is also widely used for teaching computer architecture fundamentals, especially in the context of ISA design and pipelining.
In this regard, hands-on experience (i.e., implementing the core processor pipeline) plays an important role to explore and understand the problems associated with pipelining.
While such a basic pipeline implementation allows running bare metal applications with a manageable implementation and infrastructure overhead, extending this basic system with more sophisticated functionality (e.g., being able to execute more complex applications or even an operating system), a more complex memory hierarchy or advanced I/O is typically not feasible for an undergraduate computer architecture course.

Although a wide variety of commercial and academic RISC-V implementations and projects offering (parts of) the above-mentioned features exist, they often are either too complex, require several extensions to be implemented, or do not meet interface, infrastructure, hardware platform or hardware description language requirements.
Therefore, the goal of this work is to explore the potential and a minimal and/or feasible set of required extensions for an existing basic 5-stage RISC-V pipeline implementation to be able to support the execution of an operating system.
This implies discovering existing operating system options for this implementation, comparing them with respect to their features and advantages/disadvantages, analyzing trade-offs and estimating the corresponding development and implementation effort for required hardware and software extensions and finally selecting an appropriate option.

As a result of this exploration, an analysis of operating systems supporting RISC-V is provided, as well as prototype implementations of an exception, interrupt and trap architecture, RISC-V M and Zicsr extensions and a simple 2-bit branch predictor.
These extensions were identified either as necessary or performance enhancing components to support the execution of an operating system of choice, FreeRTOS, set up in a simple configuration to run a two-threaded application.

\end{abstract}

% Select the language of the thesis, e.g., english or naustrian.
\selectlanguage{english}

% Add a table of contents (toc).
\tableofcontents % Starred version, i.e., \tableofcontents*, removes the self-entry.

% Switch to arabic numbering and start the enumeration of chapters in the table of content.
\mainmatter

% CHAPTER
\chapter{Introduction}
\label{chapter:introduction}

\section{Motivation}

The motivation of this bachelor thesis is to explore the RISC-V hardware and software ecosystem, such that a clear and concise picture of available software and its actual hardware requirements emerges.
This undertaking is intended to enhance an existing relatively basic pipelined RISC-V processor implementation supporting the 32-bit base integer instruction set (RV32I) by studying available implementation paths and providing required new functionality.
Students of the undergraduate level Computer Engineering study program take this task on to complete a digital design and computer architecture lab course, with teams of two students each producing their own implementation of the so-called MiRiV processor.
MiRiV serves to show students the basics of pipelining and its related concepts.
Usually students end the course with an actual and functional pipelined RV32I implementation.

The end result, already large in scope for a lab course, is however still limited in its support for the execution of more complex applications.
Students get to implement a basic processor almost from the ground up, being provided an environment in which to develop their designs and test their implementation.
But going beyond a bare-metal execution of simple programs remains a difficult challenge.
Real world computers and microcontrollers consist of more than the bare-metal execution of simple programs.
Thus, there exists a great opportunity to extend this existing environment such that students can see their own implementation perform tasks more reminiscent of actual workloads found in the real world.

With the support of a real-time operating system, FreeRTOS, this desired situation has been realized.

\section{Outline}

In Section~\ref{section:riscv-manual}, we will take a quick look at the official RISC-V specifications.
The extent of the MiRiV processor in its entirety, together with a description of the platform (target devices and software environment) it runs on is discussed in Section~\ref{section:miriv}.
Running operating systems on RISC-V architectures will be discussed in Section~\ref{section:support-exec-os}.
Considerations about whether to use existing RISC-V processor implementations with openly available source code will be laid out in Section~\ref{section:utilizing-existing-implementations} while available options will be explored in \ref{section:some-existing-implementations}.

Chapter~\ref{chapter:overview} will explore RISC-V standard ISA extensions in Section~\ref{section:overview-ext} and several available operating systems in Section~\ref{section:overview-os}.
In this second chapter the state of the current software ecosystem is examined and on this basis an implementation path is chosen.
We will consider real-time operating systems and embedded Linux based operating systems in more detail.
The findings will be concluded in Section~\ref{section:overview-conclusion}.
%We will consider real-time operating systems in more detail in Section~\ref{section:overview-rtos} and Linux based operating systems in \ref{section:overview-linux}.

Afterwards, in Chapter~\ref{chapter:architecture} a few selected extensions will be illustrated by way of providing architectural considerations and their design for the final implementation.
These include the exception, interrupt and trap architecture (section \ref{section:exc-int-trap-arch}), as well as considerations for the implementation of the M (section \ref{section:m-arch}), A (section \ref{section:a-arch}), Zifencei (section \ref{section:zifencei-arch}) and Zicsr (section \ref{section:zicsr-arch}) extensions and a branch predictor (section \ref{section:bp-arch}).

Finally, implementation details are outlined in Chapter~\ref{chapter:implementation}.
Section~\ref{section:preparing-miriv} will explain modifications made to the basic MiRiV processor implementation, giving a short overview of all architectural changes.
More elaborate implementation details are also provided in Sections \ref{section:exc-int-trap-implementation} (exceptions, interrupts and traps), \ref{section:m-ext-implementation} (M extension) and \ref{section:bp-implementation} (modified branch predictor).
Section~\ref{section:preparing-freertos} will show how FreeRTOS is incorporated into the already existing suite of software tests, what options it offers, how it can be configured and how it is intended to run on the extended MiRiV processor and in Section~\ref{section:freertos-demo} we will discuss the demo application based on FreeRTOS that has been implemented.

The results are reflected on in Chapter~\ref{chapter:conclusion}, where Section~\ref{section:future-work} details possible future trajectories of the extended MiRiV processor and Section~\ref{section:conclusion} gives a summary of this entire effort.

\section{Note on RISC-V Specifications}
\label{section:riscv-manual}

Throughout this document we will refer to both RISC-V specifications that are part of the RISC-V Instruction Set Manual.
Volume 1 is the RISC-V unprivileged specification \cite{riscv-isa-vol-1} and Volume 2 is the RISC-V privileged specification \cite{riscv-isa-vol-2}.
Both try to keep implementation details as undetermined as possible, to support a wide variety of implementations while still ensuring enough similarity between processors implementing the RISC-V specifications where it is necessary, for example to avoid excessive fragmentation.
The RISC-V privileged specification describes different privilege modes for a privileged architecture, to support all kinds of execution environments.
A configuration could include privilege modes for the machine, a hypervisor, a supervisor and the user.
The RISC-V unprivileged specification includes all base instructions, \textit{unprivileged} instructions, that can be used in all privilege modes.
It also contains optional ISA extensions that supply additional hardware functionality.

\section{The MiRiV Processor}
\label{section:miriv}

MiRiV is a minimal RISC-V implementation that covers the RV32I base integer instruction set.
We will now look at the design of its pipeline, as well as the hardware and software environments it runs in.

\subsection{Pipeline Design}

The implementation is pipelined and the pipeline design is the basic and canonical RISC-V pipeline as outlined in the definitive book by David A. Patterson and John L. Hennessy \cite{comporg2020}.
The five stages of this pipeline are the instruction fetch stage, the instruction decode stage, the execute stage, the memory stage and finally the write-back stage.
Figure~\ref{fig:miriv-pipeline} serves to show the individual components of this implementation.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{miriv-pipeline.png}
    \caption{The base MiRiV pipeline \cite{ddca2021}. For the final, modified pipeline with all necessary extensions refer to Figure~\ref{fig:miriv-new-pipeline}.}
    \label{fig:miriv-pipeline}
\end{figure}

In addition to the pipeline itself, a control unit that stalls and flushes the pipeline accordingly in order to properly detect and handle hazards is implemented.
Forwarding from the memory and write-back stages is also implemented, for writes that aren't yet reflected by reads from the register file in the decode stage.

Finally, caches are implemented, ranging from a direct-mapped cache up to 2-way or 4-way set associative caches.
Implementing a cache makes the processor more realistic and viable.
For the students it also serves to relax memory assumptions (such as instructions or data being available at the next clock cycle) and illustrate associated challenges.

It is important to also discuss what is \textit{not} part of the processor implementation, to begin to understand its limitations.
There is no interrupt unit and there are no trap handlers to handle exceptions and interrupts.
Exceptions are implemented as signals within respective design units, to capture decode exceptions and load and store exceptions.
But so far they exist only as unused signals that can be probed inside a simulation.

Furthermore, none of the extensions of the RISC-V unprivileged \cite{riscv-isa-vol-1} or privileged \cite{riscv-isa-vol-2} specifications are implemented.
Notably, this means that Zicsr, counters (except for a simple 32-bit memory-mapped counter mentioned in Section~\ref{subsection:software-environment}) and timers aren't implemented, so there are no control and status registers (CSRs) or support for instructions manipulating them.
Other extensions like the multiplication, atomic or floating-point extensions are also currently not supported.

Memory is confined to embedded memory on the FPGA chip and there is no support for virtual memory, as the implementation lacks a memory-management unit (MMU).
Both the instruction and data memories are limited to 16 KiB each.
There is also no support for different privilege modes.

This is fine for a basic pipeline implementation that exists for learning purposes, but as we will see some extensions and changes are required in order to run an operating system.

\subsection{Target Device}

As the purpose of the lab course is the development and implementation of the MiRiV processor, the target devices consist of field-programmable gate arrays (FPGAs).
The hardware description language (HDL) used is VHDL.
Two specific boards are targeted, the Terasic Altera DE2-115 board as well as the Terasic Altera DE0-Nano board (both hosting FPGAs from the Cyclone IV E family of FPGAs).

These FPGA boards provide many on-board peripherals, although not many are used for this processor.
Instruction and data memory are instantiated as embedded memory on the FPGA chip itself.
No peripheral memory components, such as SDRAM for example, are used in the basic MiRiV implementation.

Once a MiRiV design is synthesized and fitted, it is flashed onto the FPGA.
The bootloader consists of script utilities that use the USB serial programmer (on-board Altera USB-Blaster) to program regions of the instruction and data memories and set the state of the processor accordingly such that it starts reading and executing the right instruction.

\subsection{The Software Environment}
\label{subsection:software-environment}

Of course such an undertaking would not be as instructive if there was no testing.
There are simulations of individual units as well as simulations that simulate the entire pipeline as a whole.
But the goal is not only to build a processor that runs in simulations, it's also to build a processor that can execute programs on hardware.

Currently, the only interfacing I/O of the MiRiV processor, once programmed onto the target device, is a USB serial connection to communicate with the host via UART.
This allows for simple textual output that showcases correct execution of the processor.
Using this UART interface is made possible by using a custom, very lightweight implementation of a small C library implementing only 10 functions.
Those functions include string comparison and string length, a time function that returns a tick counter value with a granularity of microseconds, as well as simple functions to read from and write to UART.

This environment already offers great testing capabilities, but any complex tasks, particularly multi-tasking, prove to be a challenge.
Due to such limitations, students are restricted to either writing simple monolithic programs or invest unrealistic efforts into understanding the system well enough to design and implement more comprehensive capabilities.
As previously outlined, because the MiRiV processor is a platform for students to learn, providing a more complete and sophisticated environment where even an operating system is supported would make the entire platform more accessible for those who want to go beyond simple bare-metal applications.

\section{Supporting the Execution of an Operating System}
\label{section:support-exec-os}

Currently there are a few supplied example programs and students can write their own test programs.
But because of the limited support for sophisticated library features these programs are invariably limited in scope.

An operating system is usually built against a C standard library implementing functionality of the C standard for the C programming language, as is a lot of other code.
A motivated student longing to see his freshly implemented processor do actual work will quickly realize that there is a mountain of work to be scaled as they will have to implement all required features of the C standard, port an existing C standard library or write a lot of their own code, all of which would be hard to test and debug.

One of the most inspiring test cases for a processor implementation of a student would arguably be an operating system, which is an application that illustrates well the complexities of running more elaborate software.
Offering them an opportunity to test out an operating system would perhaps even inspire them to join the effort and make further extensions, which constitutes a clear-cut path to increasing the scope of MiRiV as a project.

Any student wondering how they can make the transition from their RV32I implementation to a system that does everything an operating system requires will be faced with similar questions like the ones this work will be concerned with.
So one additional goal is to serve as a reference and fairly comprehensive overview for anyone looking to get into this difficult space.

It is not that difficult to find an unconstrained set of extensions needed to run an operating system.
There are several RISC-V implementations that can run operating systems that list in their specifications what they have implemented, but it is difficult currently to find out what \emph{minimal} set of extensions are necessary for any given operating system.

The question of implementing a minimal set of extensions would be relevant to any interested students because after completing the lab course, they will have implemented for themselves only the 32-bit base integer instruction set.
Targeting the minimal set of extensions does not only constrain the scale of the implementation effort and lead to a working example quicker, but also reduces complexity and thereby makes the task at hand easier.

\section{Utilizing Existing Implementations}
\label{section:utilizing-existing-implementations}

Several RISC-V processor implementations that support the execution of even operating systems like Linux with many of its features already exist \cite{web-riscv-boards-exchange}.
By nature of the RISC-V ecosystem, open-source implementations can even be found with ease and freely used, re-used, even extended or contributed to.
It is fair to ask whether there isn't already an implementation that can serve the purpose outlined in the motivation.
Several factors come into play, such as complexity, extension support and whether they meet all other requirements.
Such requirements include interface, infrastructure, hardware platform or hardware description language (HDL) requirements.

The author is not aware of a single existing RISC-V processor implementation that can readily and easily be adapted to fit for this use case.
For one, if they are too complex, they pose too great a leap from the MiRiV implementation in the hands of students.
Some implement and depend on too many extensions, which in turn raises their complexity.
Common performance features such as out-of-order execution, superscalar design, superpipelined design or sophisticated branch prediction have a considerably higher barrier to entry for students who want to improve on their own humble implementation.

An implementation is usually bound to a hardware platform and makes extensive use of an infrastructure developed as part of the implementation effort.
In adapting an implementation, constraints like supported hardware platforms are inherited and existing infrastructure can make for a challenging integration effort.
One particularly important consideration is the memory organization and by extension, support for memory interfaces.
Even if an implementation turns out to be compatible, they might still be based on different HDLs.

This will more easily enable students to integrate their solution.

There are many ways to modularize, extend and implement units in different HDLs.
Because of this, the issue of packaging a set of extensions so that a student can integrate them effortlessly becomes intricate.
The first goal is to keep a common base with the extended MiRiV implementation.
The second is to use isolated modules and precompiled modules that rely as little as possible on the parts of MiRiV that will be variable due to student implementation variance.
The assignment itself has fixed interfaces, making unknown implementations more predictable.
If some of these factors cannot be completely accounted for, it is also possible to make it the task of students to perform basic implementation tasks in order to make their implementation compatible.

\section{Some Existing Implementations}
\label{section:some-existing-implementations}

We do still want to mention other existing implementations, as it gives a good overview of features that can be implemented.
There is a wealth of open source RISC-V designs and the designs range from basic CPU implementations, entire Systems-on-Chip (SoC) to boards containing peripheral devices.
Some of them are meant to be taped out on actual integrated circuits to be put on actual boards, while others exist purely as soft-core implementations meant for FPGA devices.
Designs may even support both approaches.

One of the earliest proponents of RISC-V were SiFive, who have brought to market development boards meant to offer RISC-V platforms with many capabilities.
Their CPU designs are modular and actual specifications can be tweaked to one's requirements.
The HiFive Unmatched board \cite{web-hifive-unmatched}, which is their current flagship device, supports on-board 16 GB 64-bit DDR4 @ 1866 MT/s, on-board 32 MB QSPI NOR flash, an optional microSD card, an optional solid-state drive via NVMe SSD M.2 Key M for their memory options \cite{web-hifive-unmatched-datasheet}.
Then there is Ethernet (10/100/1000 Mbps) and optional Wi-Fi and BLE support for networking options.
There are also many connectors, too many to be listed here.
Their processor is the SiFive FU740-C000 operating at 1.2 GHz.
It has a dual-issue in-order 64-bit execution pipeline, four 64-bit SiFive U74 cores supporting RV64IMAFDC (or RV64GC for short) with 32KB instruction and data cache per core and a shared 2 MB coherent banked L2-Cache.

On the side of the soft-core implementations, there is the LiteX VexRiscv and the NEORV32.
The LiteX VexRiscv is an FPGA friendly 32-bit RISC-V CPU soft-core processor \cite{web-vexriscv}.
It can be used in a minimal configuration targeting RV32I in a no MMU (nommu) fashion, up to much more sophisticated configurations targeting RV32IMAFDC (RV32GC) with an MMU.
There are also debugging features and interrupt and exception handling with machine mode (M-mode) and optionally supervisor mode (S-mode) and user mode (U-mode).
Depending on the configuration and FPGA resources, its frequency can go as high 243 MHz.
Another soft-core implementation is the NEORV32, which is another FPGA friendly 32-bit RISC-V CPU implementation \cite{web-neorv32}.
Similarly to the LiteX VexRiscv it supports a range of different configurations.
It can run implementing only RV32I or RV32E, or extended to implement most extensions currently available, in addition to exception and interrupt handling.
Out of the RISC-V privileged specification it implements M-mode and U-mode.
Using RV32IMC with Zicsr and the Counters extension on an Intel Cyclone IV FPGA its maximum frequency is 128 MHz.
Both can come only as processor cores or in the package of an entire SoC.
%This flexibility is a useful property when considering how to get an operating system running by adding minimal extensions to the existing MiRiV implementation.

None of these existing implementations are trivial to adapt in order to extract a set of processor extensions that students can use for their MiRiV implementation to get an operating system running.
The challenge lies in the complexity of these projects, with many features being offered that are tightly integrated into the structure of the processor.
The NEORV32 for example is not a traditional pipelined processor but uses a multi-cycle architecture, while the LiteX VexRiscv uses a configurable pipeline with two to five stages, but is implemented using a Scala-based HDL called SpinalHDL.

%On the basis of this exploration, a decision was made to extend the MiRiV processor instead of using any existing implementation.

% CHAPTER
\chapter{Overview and analysis of the RISC-V ecosystem}
\label{chapter:overview}

As the centerpiece of this work is to connect the worlds of a basic RV32I hardware implementation and an operating system, two different aspects need to be considered simultaneously.
Which operating systems exist and what are their requirements?
Subsequently, which requirements are already implemented and which need to be implemented?

This malleable task where both the hardware architecture as well as the operating system and its configuration can be extended proves a difficult design challenge.
Designing the hardware to support the execution of an operating system is a bottom-up task where one needs to work at least initially towards a basic set of features beyond the already existing ones.
On the other hand, choosing and configuring the operating system is a top-down task where one works towards a reasonable configuration that offers sufficient capabilities together with a clear path towards building the hardware for it.

To avoid making a haphazard decision, we use a systematic approach towards comparing operating systems, their precise requirements and possible hardware extensions.
Our methodology for choosing an implementation path is to first consider RISC-V hardware extensions as they are standardized, in order to acquire an understanding of the complexity of different extensions that will be helpful during a subsequent analysis of the operating systems.
Afterwards, we analyze which operating systems are available to make a decision as to what operating system to use for our task.
To do this, we filter choices in two rounds.
In the first round, we roughly scan for operating systems with RISC-V support and what level of support is being offered, by using sources such as documentation.
This is done to gauge how large of an architectural change or porting effort would be necessary either in the MiRiV implementation or the operating system.
One such example is operating system support for 64-bit RISC-V processors only, which would incur larger architectural changes or porting efforts compared to a different operating system supporting 32-bit RISC-V processors.
After a preliminary choice of contenders for the second round, we analyze their minimal requirements by building the operating systems and looking at which extensions can be done without by configuring the build process.

In general, there are many possible extensions to make.
Within the scope of this thesis, they can be considered either strictly necessary, convenient, performance enhancing or expendable.
This categorization scheme will be elaborated upon in Section~\ref{section:categorization-scheme}.
Section~\ref{section:overview-ext} will describe in detail what (stable) RISC-V standard ISA extensions currently exist as part of the RISC-V specifications.
Section~\ref{section:overview-os} will outline possible operating system choices, configurations and which extensions they require.
Because there are too many operating systems for a more precise comparison, as part of the first round our preliminary choice of operating systems will be based on data gathered from documentation and similar sources and the decision process will be discussed in Section~\ref{section:making-a-choice}.
Finally in Section~\ref{section:os-ext-analysis} we will put everything together in an analysis of operating systems of choice as part of the second round, by classifying extension requirements in the above-mentioned categorization scheme.

To conclude this entire chapter, we will summarize our insights and single out an operating system that will be used as a target for the final implementation in Section~\ref{section:overview-conclusion}.

\section{Hardware Extension Categorization Scheme}
\label{section:categorization-scheme}

As we will need to decide which of the RISC-V standard ISA extensions and other hardware features to implement, it is necessary to understand what purpose they are meant to serve.
In general, extensions to the MiRiV processor can be either implementations for RISC-V standard ISA extensions, or hardware features that aren't explicitly specified but still necessary or otherwise beneficial to support the execution of an operating system.
Any extension can be said to fall within one of four categories:

\begin{itemize}
\item \textbf{strictly necessary}

Some hardware features such as support for exceptions, interrupts and trap handlers, in addition to several control and status registers specified in the unprivileged ISA extension Zicsr are likely unavoidable.

\item \textbf{convenient}

The hardware can be extended to implement conventional features like a memory-management unit (MMU) that are sometimes assumed by operating systems as a prerequisite but can be avoided with some effort (the MMU example is briefly touched upon in Section~\ref{subsection:memory-hierarchies}).

\item \textbf{performance enhancing}

Other extensions still can serve to improve various aspects of system performance.
The RISC-V C extension for instance can potentially greatly reduce code size, thereby relaxing not only memory constraints but perhaps even increasing instruction cache performance.
Other more complex features, such as out-of-order, superscalar and superpipelined execution can greatly increase processor performance, but aren't strictly necessary for supporting an operating system and would go against the goal of preserving the core pipeline as it is.
Hosting multiple cores also fits this class.
All of these are features that offer great opportunities for future work.

\item \textbf{expendable}

On the other hand, a floating-point unit (FPU) can be critical for some applications but expendable for others.
Several embedded systems forsake an FPU to drive down costs, which is entirely reasonable in case floating-point arithmetic can be avoided entirely.
In case floating-point operations cannot be done without, they can also be emulated in software\footnote{In the RISC-V GNU toolchain, the ABI \texttt{ilp32} assumes 32-bit software emulated floating-point operations (called soft-floats), while \texttt{ilp32f} and \texttt{ilp32d} are for 32-bit single-precision hard-float and 32-bit hard-float respectively.}.

\end{itemize}

\section{Overview of RISC-V extensions}
\label{section:overview-ext}

Capabilities of RISC-V processors which go beyond the base insteger instruction set are specified in both the RISC-V unprivileged \cite{riscv-isa-vol-1} and privileged \cite{riscv-isa-vol-2} specifications using individual and mostly decoupled extensions.
Many of these extensions are common features of modern processors, but often skipped in an educational context.
Every extension has trade-offs that are made quite deliberately and it is instructive to consider these extensions and their trade-offs in more detail.

\subsection{Zifencei Extension \cite[ch. 3]{riscv-isa-vol-1}}
\label{subsection:overview-zifencei-ext}

The base integer instruction set already includes a rudimentary fence instruction (\texttt{FENCE}).
As it is used to order I/O and memory accesses between different hardware threads, which becomes relevant when load and store operations are re-ordered, it is useful in a multi-core context, but also for single-core systems implementing interrupts.
The RISC-V unprivileged specification \cite[ch. 1]{riscv-isa-vol-1} introduces the term \textit{hart} to refer to a hardware thread, which is defined to be an abstract execution resource in the context of an execution environment and its relationship to a \textit{core} is such that a core might support one or more harts.

This extension serves to add another fence instruction (\texttt{FENCE.I}) with the following purpose:
It is a fence instruction that explicitly synchronizes writes to instruction memory and instruction fetches on the same hart.
According to the RISC-V unprivileged specification \cite{riscv-isa-vol-1}, "RISC-V does not guarantee that stores to instruction memory will be made visible to instruction fetches on a RISC-V hart until that hart executes a FENCE.I instruction."

%It is an general instruction applicable in many contexts, with fields for a source register, destination register and immediate bits, all of which are currently unused.
%These are reserved for possible future use in more specialized fence instructions.

\subsection{M Extension \cite[ch. 7]{riscv-isa-vol-1}}
\label{subsection:overview-m-ext}

The M extension adds new instructions to support integer multiplication and division.

The added multiplication instructions (\texttt{MUL} and \texttt{MULH[[S]U]}) have variations for signed and unsigned integer multiplication and also lower and higher result bits.
Signed and unsigned integer multiplication is necessary by virtue of RISC-V supporting both types of storing numbers and supported are signed $\times$ signed, unsigned $\times$ unsigned and signed $\times$ unsigned multiplications.
For integer multiplication, the resulting bits can exceed the source register size to double that amount, so instruction variations exist where the target register stores either the low or high 32 bits of the 64 bits resulting from the computation.

As for the division instructions, there are instructions for storing the quotient and remainder (\texttt{DIV[U]} and \texttt{REM[U]}), both using either signed or unsigned values.
Special care is needed to handle division by zero and in the case of signed divisions, an overflowing quotient when dividing the most-negative integer ($-2^{31}$) by $-1$.

One benefit of implementing the M extension is reduction in code size, as the compiler can simplify the code.
Without M extension support, the compiler has to resort to multiple addition and shift operations that are usually unrolled loops.
For more complex calculations (like modulo operators) the compiler might even insert auxiliary function calls.

\subsection{A Extension \cite[ch. 8]{riscv-isa-vol-1}}
\label{subsection:overview-a-ext}

The A extension adds new instructions to support atomic operations.
Atomic operations are meant to keep the data consistent when multiple processes read or change state.
For multi-core systems this is important because multiple cores could interact in an undesired fashion.
A classic example would be multiple cores accessing the same data, where non-atomicity could cause race conditions.
On single-core systems this issue might still arise in systems with preemption and multiple tasks.
For single instruction operations, there is no issue, but when atomicity needs to be ensured for multiple instructions interrupts could cause a context switch and lead to consistency issues.

Making multiple operations atomic in RISC-V is done by adding Load-Reserved/Store-Conditional (\texttt{LR.W} and \texttt{SC.W}) and atomic memory operation (AMO) instructions (\texttt{AMO(SWAP|ADD|AND|OR|XOR|MAX[U]|MIN[U]).W}).

Load-Reserved and Store-Conditional are instructions that can work in conjunction to ensure atomicity, thereby offering a way to synchronize hardware threads (harts) or ensure consistency on single-core systems.

Load-Reserved has a single source register and a destination register and reserves the memory address stored in \texttt{rs1}.
Store-Conditional has two source registers and a destination register and checks whether the reservation (at the memory address stored in \texttt{rs1}) is still held and stores the contents of \texttt{rs2}.
Depending on the value stored in the destination register after the Store-Conditional operation, the hart can check whether the atomic operation was successful.

The AMO instructions use a similar format, but with different semantics.
They atomically load a value from the address stored in \texttt{rs1}, placing the value into register \texttt{rd} in addition to applying a binary operator to the loaded value and the value stored in \texttt{rs2} and finally storing the result back to address the address in \texttt{rs1}.
These binary operators are a swap operation, integer add, bit-wise operations (\texttt{AND}, \texttt{OR}, \texttt{XOR}) and signed/unsigned integer maximum and minimum.

\subsection{Zicsr Extension \cite[ch. 9]{riscv-isa-vol-1}}
\label{subsection:overview-zicsr-ext}

The Zicsr extension serves to add an address space of 4096 control and status registers (CSRs) and a set of instructions that operate on these CSRs.
To be precise, there is a 12-bit encoding space to support up to 4096 CSRs.
Their addresses, encoding, purpose and convention is described extensively in the RISC-V privileged specification \cite{riscv-isa-vol-2}.
To give a quick overview, there are unprivileged and user-level CSRs, supervisor-level CSRs, hypervisor and VS (virtual supervisor) CSRs, machine-level CSRs, unprivileged floating-point CSRs and finally, unprivileged counters and timers.
The Zicsr extension itself does not necessarily concern itself with the assignment of CSRs but solely how instructions can interact with them.

There are six instructions added as part of the Zicsr extension, alongside seven pseudo-instructions, six pseudo-instructions related to counters and nine additional pseudo-instructions related to floating-point registers.
The six instructions semantically consist of three instructions, with source register and immediate variants.
In essence, the instructions atomically read the value of a CSR.
Additionally, they write it to the destination register and write new contents into the CSR, although CSRs can be set up to be read/write or read-only.
The first two instructions (\texttt{CSRRW[I]}) are meant to transfer the value of a source register (or an immediate value) to a CSR, while writing the previously held value of the the CSR to the destination register.
The next two instructions (\texttt{CSRRS[I]}) similarly read the value of the CSR and update the CSR, but they use the value in the source register (or an immediate value) as a bit mask instead.
Masked bits are set high in the CSR, while the previously held CSR value is written to the destination register.
The final two instructions (\texttt{CSRRC[I]}) have the same semantics, but they clear bits instead.
Depending on whether the zero register is used as a source or destination register, reads and writes might explicitly not occur.
Because there are many different possible operations, many pseudo-instructions based on these six instructions are offered.

\subsection{Counters Extension \cite[ch. 10]{riscv-isa-vol-1}}
\label{subsection:overview-counters-ext}

The Counters extension (sometimes, but not officially, referred to as Zicntr) is meant to encapsulate all that is necessary to support the operation of hardware performance counters and timers.
Compared to the other extensions, this extension is not actually ratified yet and is subject to change.
It does not actually add any new instructions, but instead offers three pseudo-instructions based on Zicsr (\texttt{RDCYCLE[H]}, \texttt{RDTIME[H]} and \texttt{RDINSTRET[H]}, all based on \texttt{CSRRS}).
This is the case because the performance counters and timers themselves are implemented as read-only CSRs.

For purposes of supporting a higher level of precision, counters and timers are 64-bit CSRs, even on 32-bit architectures.
This is made possible by splitting up the 64-bit values into two separate CSRs.
The pseudo-instructions that have an \texttt[H] suffix read bits 63 to 32.

The \texttt{RDCYCLE[H]} pseudo-instruction reads the CSR \texttt{cycle} which holds a count of the number of clock cycles executed by the processor core on which the hart is running.
The \texttt{RDTIME[H]} pseudo-instruction reads the CSR \texttt{time} which holds the wall-clock real time.
The \texttt{RDINSTRET[H]} pseudo-instruction reads the amount of instructions that were retired by a hart from the CSR \texttt{instret}.
All of these counters count from some arbitrary point in the past.

This conflicts slightly with the RISC-V privileged specification \cite{riscv-isa-vol-2}, which does not specify \texttt{time} as a CSR, but rather two 64-bit memory-mapped registers \texttt{mtime} and \texttt{mtimecmp}.

Additional CSR space is available for hardware performance counters, which can be used for profiling by counting platform-specific events.
They have similar semantics.

\subsection{F, D and Q Extension \cite[ch. 11-13]{riscv-isa-vol-1}}
\label{subsection:overview-fdq-ext}

There exist three separate extensions, all meant to incorporate IEEE 754-2008 floating-point support.
The F extension is meant for single-precision floating-point support, while the D extension targets double-precision and the Q extension quad-precision.
They are set up in a way that implicitly assumes implementation of Zicsr and one another, with the F extension requiring the Zicsr extension, the D extension requiring the F extension and the Q extension requiring the D extension.

The F extension adds a new set of 32 floating-point registers and a CSR that collects accrued exceptions and the rounding mode.
There are plenty of new instructions added to support all kinds of floating-point operations, going well beyond the arithmetic operations of the base integer instruction set.

\subsection{C Extension \cite[ch. 16]{riscv-isa-vol-1}}
\label{subsection:overview-c-ext}

The C extension is meant to add compressed instructions to reduce static and dynamic code size.
This is achieved by trimming common operations to 16-bit instructions.
As such, instructions can now start on any 16-bit boundary.
Many short hand versions of pre-existing instructions are added.

\subsection{Machine-Mode (M-mode) \cite[ch. 3]{riscv-isa-vol-2}}
\label{subsection:overview-m-mode-ext}

This is the first extension listed here that is not part of the RISC-V unprivileged specification, but rather the RISC-V privileged specification, which describes different privilege modes with their respective specified functionality.
It adds several machine-level CSRs as well as Trap-Return and Interrupt-Management instructions.
For M-mode, these instructions are a \texttt{MRET} (M-mode trap return) and \texttt{WFI} (wait for interrupt) respectively.

The CSRs which are added range from read-only CSRs that supply system information to others that control processor behavior.
Examples for this are the \texttt{misa} register, which encodes what parts of the modular RISC-V ISA are implemented and supported by the processor.
Other important CSRs are \texttt{mie} (interrupt-enable register), \texttt{mip} (interrupt pending register) \texttt{mtvec} (trap-handler base address register), \texttt{mstatus} (machine status register), \texttt{mepc} (exception program counter register) and \texttt{mcause} (trap cause register), to list a few.

Not every CSR listed has to be implemented and this applies to all privilege modes.
This is because operating systems do not necessarily make use of all of them.
For some of them, the operating system can still operate without them being implemented if they return default values.

\subsection{Supervisor-Mode (S-mode) \cite[ch. 4]{riscv-isa-vol-2}}
\label{subsection:overview-s-mode-ext}

This extension serves to add optional supervisor-level counterparts of the M-mode extension for the lower S-mode privilege level.
We have seen embedded Linux using glibc make use of this extension.
In general, S-mode is meant for operating systems, while M-mode offers low-level access to the hardware platform.

More instructions are added, which are not only variations of M-mode counterparts.
There is an analogous \texttt{SRET} instruction, but supervisor-level variations of the fence instructions are also added (\texttt{SFENCE.VMA}, \texttt{SINVAL.VMA}, \texttt{SFENCE.W.INVAL} and \break\texttt{SFENCE.INVAL.IR}).

One would need to implement S-mode to support paging.

\subsection{User-Mode (U-mode)}
\label{subsection:overview-u-mode-ext}

The U-mode privilege level is mentioned throughout the RISC-V privileged specification \cite{riscv-isa-vol-2}, but is not as extensively specified as M-mode and S-mode.
This makes sense, because U-mode is intended as the \emph{least privileged} mode and serves to create a boundary to the next privilege mode in order to protect the system from application code.

Valid systems adhering to the privileged specification must have at least M-mode implemented, but it is optional whether to also implement S-mode or U-mode.
As a result, possible configurations are \texttt{M} (simple embedded systems), \texttt{M + U} (secure embedded systems) and \texttt{M + S + U} (systems running Unix-like operating systems), as listed in Table 1.2 of \cite{riscv-isa-vol-2}.

\subsection{Hypervisor-Mode (H-mode) \cite[ch. 8]{riscv-isa-vol-2}}
\label{subsection:overview-h-mode-ext}

This extension implements another privilege level in addition to the other privilege levels.
It is meant for execution environments in which a hypervisor or virtual supervisor is used.

\subsection{Memory Management Unit (MMU)}
\label{subsection:overview-mmu-ext}

Also to be considered a hardware extension is the memory management unit.
This differs from the other hardware extensions in that the MMU is not defined in either of the RISC-V specifications.
But for certain operating systems, having an MMU is required.
We will consider the details in a later part after getting an idea of operating systems, in Section~\ref{subsection:memory-hierarchies}.

\section{Overview of Operating Systems}
\label{section:overview-os}

In order to realize the goal of supporting an operating system, having a precise set of requirements is necessary.
Without comprehensively looking at available options and making an initial choice, it is not possible to focus on this goal.

What follows in this section is an overview that will list operating system choices for RISC-V, details what they offer and require and finally we will make a choice about which one to pick.

As the goal is to offer students something to tinker with for their own understanding, the more features can be offered the better, although this has to be contrasted against operating system complexity to keep the implementation effort feasible.

%\subsection{Available Operating Systems}

RISC-V has certainly gained traction since its inception as a research and teaching ISA.
One proof of this is the wide range of operating systems that do in some way support it.
We will take a quick look at the current state of available operating systems.

There are roughly three classes of operating systems that are available right now for RISC-V processors.
Sometimes software running on RISC-V processors is classified into bare-metal, real-time operating systems and rich operating systems \cite{web-riscv-supporting-an-os}.

Bare-metal software might not qualify as an operating system, but as a category it captures low-level software that does not fall under any of the other two categories and is often a primary target or prerequisite for prototypes due to its relative simplicity.
Although operating systems themselves would qualify as bare-metal software, this term is still used to refer to software which directly takes control of the underlying hardware and does not introduce an additional abstraction layer that fulfills the purposes of an operating system.

A rich operating system is geared more towards enabling many different kinds of purposes without focusing too much on a particular aspect.
For example, they might be used in a lightweight laptop, a powerful desktop workstation or a server, each with their own sets of typical software applications.

Real-time operating systems on the other hand might be considered restricted operating systems, with the specific purpose of fulfilling real-time constraints.
Other special purpose serving operating systems exist, like for example seL4 \cite{web-sel4}, a security-driven but fast microkernel platform.
These other special purpose serving operating systems do not see themselves as a traditional real-time operating system (but might still be real-time, like SAFERTOS \cite{web-riscv-supporting-an-os}) and will be considered separately.

\subsection{Real-Time Operating Systems (RTOSs)}

This class of operating systems is often used to run on microcontrollers that are small, resource-constrained but sometimes mission-critical.
They strike a delicate balance of supporting enough to serve their purpose and keep necessary guarantees but little enough to drive costs low.
Several RTOSs that have been ported to RISC-V devices exist, but their software ecosystems have varying degrees of support for the RISC-V platform as a whole.

To the best of our knowledge, the two best supported RTOSs for RISC-V right now, particularly for the given constraints, are FreeRTOS \cite{web-freertos} and Zephyr RTOS \cite{web-zephyr}.
They don't only offer ports for specific RISC-V processors, but provide comprehensive documentation on how to utilize and adapt them for new ones as well.

FreeRTOS has a dedicated page, among other entries, detailing the steps to get FreeRTOS on a RISC-V platform \cite{web-freertos-riscv}.
Another advantage is that FreeRTOS is configurable enough to support the operation of RISC-V processors with varying extension support.
There are even features to configure it out of the box for processors with different exception and interrupt unit architectures, like for example a SiFive core local interruptor (CLINT)~\cite{sifive-interrupt-cookbook}, and architectures lacking hardware timers.
A lack of hardware timers refers to not implementing \texttt{mtime} and \texttt{mtimecmp} out of the RISC-V privileged specification \cite{riscv-isa-vol-2}.

The other contender is Zephyr RTOS, which is a Linux Foundation project that serves as a software platform for many different microcontrollers.
There is a wide array of actual hardware boards that are supported (18 as of yet!) \cite{web-zephyr-riscv-boards}.
Previously mentioned RISC-V processors, namely the SiFive FU740-C000, LiteX VexRiscv and NEORV32, are supported for example.

There are two more RTOSs that have ports for RISC-V, but lack documentation and only support the operation of highly specific RISC-V processor configurations, none of which are minimal.
This is reasonable, because they target boards that offer a certain minimum of features, but they are hard to deal with for the purposes of this work.
These RTOSs are Huawei LiteOS \cite{web-huawei-liteos} and Apache Mynewt \cite{web-apache-mynewt} respectively.
FreeRTOS and Zephyr RTOS, currently offer much richer flexibility, support and documentation.

\subsection{Rich Operating Systems}

Another class of operating systems exists that has a wider range of available features and customizability.
It encompasses all kinds of use cases, from a desktop computer all the way to a high-performance computing context, with many distributions existing to target a specific set of constraints but otherwise offering a general solution.
Several popular operating systems belong to this class, but don't currently offer any \mbox{RISC-V} support.
Linux however, by way of its open ecosystem as well as active community, does offer support.
For this reason, but also for interoperability and accessibility, we will restrict our analysis of rich operating systems and what it takes to support them to Linux.

There are many ways to utilize Linux as an operating system.
Linux by itself is only a limited component in what constitutes a whole operating system and is usually complemented by additional software.
Because there are so many ways in which Linux can be used, we want to look at Linux configurability and differentiate between full-fledged Linux distributions and embedded Linux.

\subsubsection{Minimal Linux Configuration}
\label{subsubsection:linux-configurations}

Linux is an operating system that serves many different purposes.
It comes in many different forms and each has its own advantages and disadvantages.
By virtue of its completely free and accessible nature, anyone can adopt Linux to fit their needs and many of the distributions we have looked at exist as adaptations.

%Usually Linux is used in a server or desktop configuration by making use of readily available Linux distributions which not only build and distribute Linux, but also offer vast package repositories, automation, documentation, and guidance.
%This is typically very useful when adopting Linux for such platforms, but for embedded platforms the situation is quite different.
%Some of the goals of server or desktop Linux run contrary to the goals of embedded Linux.
%Instead of prepackaging as many programs and features that a user might need, the distribution needs to be kept as small and compact as possible.
%User-friendliness is also not as important.

%All of these different use cases for an operating system like Linux and many more are supported in some way.
%Users of Linux have made sure that it runs on a breathtaking amount of platforms, bridging what seems like deal-breaking differences.
%One such example is support for many different ISAs and microarchitectures, RISC-V and some of its live hardware implementations on actual boards included.
%The most sophisticated server platforms offer complex SIMD instruction support, advanced virtualization features and other performance and security innovations.
%By contrast, Linux runs on those platforms, just like it does on others that don't even offer virtual memory support by lack of an MMU.

A minimal Linux configuration is hard to achieve.
In the first place, it is not clear what the lowest possible requirements really are.
Additionally, some configurations have more support available, with respect to build tools, hardware platforms and general software support.
In theory Linux can support any system, but the question is always at what cost.
The most prevalent uses for Linux have led to an ecosystem in which using Linux for precisely such purposes is easy.
The most obscure uses suffer from their prerequisites to dig deeper, tinker around and making far reaching customizations.

Configuration has to be differentiated between kernel configuration and user space configuration.

\subsubsection{Linux Auxiliary Software}
\label{subsubsection:linux-aux-software}

Linux is only a kernel.
To transform Linux into an operating system that doesn't only perform the operations of a kernel, some sort of user space must be added.
If code is executed only by the kernel it is said to live in kernel space.
Likewise, code that is executed by a user process is said to live in user space.
This separation exists for memory and hardware protection purposes and comes from the fact that code inside virtual memory can be considered to belong to only one of these two spaces.

Besides using Linux as a kernel, different options exist for the user space.
It is primarily provided by way of the C standard library and the applications that come with the operating system itself.
Auxiliary software refers to these additional components of a Linux based operating system

Historically there have been projects that attempt to bring down the size and requirements of Linux, like uClibc \cite{web-uclibc} (the recent version being uClibc-ng), musl \cite{web-musl}, busybox \cite{web-busybox} and buildroot \cite{web-buildroot} to list a few.
uClibc represents a basic C standard library for more minimal purposes than that of the de facto standard for Linux distributions \cite{web-linux-standard-clib}, glibc, developed by the Free Software Foundation.
The other C standard library that is an alternative to glibc is musl, which has the goals of being lightweight, fast, simple, free and correct.
The busybox user space is an alternative to the more resource-intensive GNU user space.
Finally, buildroot serves as a convenient tool to generate embedded Linux systems through cross-compilation.
The projects uClibc, busybox and buildroot are projects that have been developed in conjunction, as they complement each other.
As an example, a Linux system can use uClibc with a busybox user space, built using the buildroot build system.

We will compare configurations using different C standard library implementations and their support, as well as configurations that do or do not use an MMU.
To refer to the lack of an MMU, we will use the term nommu.

\subsubsection{Full-fledged Linux Distributions}
\label{subsubsection:full-fledged-linux}

Several of the major Linux distribution projects have opted to support RISC-V, with varying levels of support.
One of the major distributors of Linux is the Debian project \cite{web-debian}, which not only makes a wealth of Linux distributions for all kinds of targets, but whose work is reflected downstream in other distributions which base themselves on it.
One such distribution is Ubuntu \cite{web-ubuntu}.

Debian has begun working on the RISC-V port many years back and its support has matured by now for several available devices \cite{web-debian-riscv}.
Besides the more low level and lightweight embedded Linux systems we will discuss in a brief moment, Debian is the only full-fledged Linux distribution that has any support for 32-bit RISC-V.

Ubuntu also supports RISC-V, but it is more experimental and limited to 64-bit RISC-V \cite{web-ubuntu-riscv}.

Another family of Linux distributions (the families itself are determined by how software made available to users of the distribution is packaged) is Fedora Linux \cite{web-fedora}.
Its derivatives include Red Hat Enterprise Linux and CentOS.
Fedora also has support for 32-bit RISC-V and some available devices \cite{web-fedora-riscv}.

Yet another distribution supporting 32-bit RISC-V is OpenSUSE \cite{web-opensuse}.
While it does offer support, as of yet its support is largely experimental \cite{web-opensuse-riscv}.

There is also Alpine Linux \cite{web-alpine}, which considers itself a small, simple and secure Linux distribution.
It makes extensive use of more lightweight alternatives to common software packages and thus enables more limited hardware platforms to perform well.
This makes it a good base for containerized applications, as containers strive to run on a base system that is as minimal as possible.
It would also be a great fit for the use case of this project and the RISC-V platform in general.
But although some work was already put into porting it to and supporting RISC-V, none of it has been finalized and there is no official documentation besides experience reports of a few users \cite{web-alpine-riscv}.

\subsubsection{Embedded Linux Distributions}
\label{subsubsection:embedded-linux}

Besides full-fledged Linux distributions, there exist also smaller projects meant to target more specific embedded architectures.
Users are always free to build their entirely own Linux distribution for their device, gaining complete customizability but also inheriting the workload associated with it.
Because this is not productive if everyone repeats such efforts, certain projects have attempted to streamline the process of building an embedded Linux distribution by offering sophisticated build systems.
An important difference between these projects and aforementioned Linux distributions is that they are not themselves Linux distributions, but build systems that can generate Linux distributions for many different embedded devices.

One such build system is buildroot \cite{web-buildroot}, which offers multiple ways for users to configure all aspects of the resulting Linux system.
Buildroot itself has support for RISC-V in several different configurations.
Extension support can be toggled, different ABIs can be targeted, target architecture size can be configured, MMU support (in general buildroot tries not to assume whether a system supports virtual memory) can be disabled and more \cite{web-buildroot-riscv}.

Another build system is the Yocto Project \cite{web-yocto}.
It exists for many of the same reasons and is sometimes compared directly against buildroot \cite{web-buildroot-vs-yocto}.
The Yocto Project uses a more modular approach and users can supplement their own compatibility layers.
These layers define all necessary configurations and recipes for users to build their own system.
Through one of these layers, the Yocto Project supports RISC-V \cite{web-yocto-riscv}.

\subsection{Special Purpose Operating Systems}

These systems form the last class of operating systems.
They are neither real-time operating systems (which are well established as a class of their own), nor one of the rich operating systems.
They don't base themselves on any of the pre-existing rich operating systems, having been built from the ground up to satisfy certain constraints to the best of their ability.
Such constraints might be code size, performance or security.
But they also serve as experimental platforms to test unique innovations on.

Because of their reduced scope, they are often simpler and thus more easily ported to new architectures.
Three such operating systems are the seL4 Microkernel \cite{web-sel4}, Harvey OS \cite{web-harveyos}, as well as HelenOS \cite{web-helenos}.
Each of them has seen efforts to extend their support to RISC-V, but with varying degrees of success.
The seL4 Microkernel as a selling point offers functional correctness proofs for hardware architectures it supports, but this has only been done for a single device that hosts a 64-bit RISC-V processor \cite{web-sel4-riscv}.
Harvey OS also has new configuration options to support RISC-V, but the full extent of its support is unclear \cite{web-harveyos-riscv}.
Lastly, HelenOS has also been ported to RISC-V, but as of yet there is no official documentation to go by \cite{web-helenos-riscv}.

\section{Making a Choice}
\label{section:making-a-choice}

An ever-present theme with many of these operating systems and their support for RISC-V is their varying degree of quality with respect to their documentation and device support.
This is only natural, as RISC-V is still relatively new and the platform as a whole is not yet mature.
While there is now quite a range of actual hardware available, RISC-V itself strives to be modular and support as many hardware configurations as possible.
This is one of the strengths, but simultaneously a weakness, as it fragments the ecosystem and makes supporting the RISC-V architectures more difficult for maintainers of software.

At this stage, much of the support for RISC-V by operating systems is experimental and exists to serve as a proof of concept.
Some of it is relatively mature however, particularly those projects that have invested more effort and started sooner.

For the scope of this thesis we cannot take a close look at all of these operating systems, so there is a choice to be made even before beginning to adapt and possibly extend an operating system insofar as necessary.
This choice also needs to align with the goals stated in Chapter~\ref{chapter:introduction}.

Looking at the real-time operating systems, out of all the choices, only two of them are at a state appropriate for use on RISC-V, FreeRTOS and Zephyr RTOS.
The other options have not yet matured.
Both FreeRTOS and Zephyr RTOS on the other hand can feasibly be ported to run on the MiRiV processor.

Although rich operating systems would be best suited for rich demonstrations of processor capabilities, they are more geared towards platforms that have support for common RISC-V extensions such as the multiplication, atomics, and floating-point extensions, more sophisticated memory hierarchies and additional performance enhancing extensions.
Particularly with full-fledged Linux distributions, for reasons of modernity and practicability, a 64-bit architecture is often assumed.
This stands in direct contrast with what is currently available with the MiRiV processor, so these operating systems will not be considered further.

As for the embedded Linux distributions, their flexibility allows them to cater to a great wealth of hardware platforms and utilizing them for the MiRiV processor is more feasible.
Between buildroot and Yocto Project, buildroot currently has more accessible documentation and is by itself more accessible because of its focus on simplicity \cite{web-buildroot-vs-yocto}.
Yocto Project, designed as a powerful tool to support powerful abstractions, comes at the price of understanding its abstractions and the tools it relies on.
Buildroot on the other hand re-uses common development tools for its build system and is constrained in complexity, making it easier to use and easier to get familiarized with.

For our purposes it is not necessary to streamline the entire process of creating elaborate Linux distributions, packaging a lot of software and enabling complex release cycles where only parts can be exchanged, which would all be possible with Yocto Project.
Because buildroot supports some of these features but in a simpler fashion, not much is lost in choosing buildroot.
As a result, we will restrict our analysis to Linux generated by the buildroot build system only.

The last class of operating systems, special purpose operating systems, would typically offer a good opportunity for processor designs like MiRiV.
Their relative obscurity and lack of extensive documentation make porting them difficult however.
They also do not exhibit as big and vibrant of an ecosystem as specifically Linux does and are more limited in their uses.
While it would be interesting to analyze them further, they will not be considered for the purposes of this thesis.

The operating systems that will thus be examined more closely are FreeRTOS, Zephyr RTOS and embedded Linux (generated using the buildroot build system).
All of them can target 32-bit platforms, like the MiRiV processor, and they will be used as such.
This is not coincidental of course, the choices were deliberately made to support the existing 32-bit design.
While a change of the design to 64-bit would be possible, the end result would not fit our purposes well, as these extensions are intended for a lab course in which 32-bit designs have to be implemented.

%For both FreeRTOS, Zephyr RTOS and embedded Linux, we will take a closer look at their extension requirements in Section~\ref{section:os-ext-analysis}. But before we do, we will take a closer look at Linux configurations and auxiliary software.

\section{Analyzing Operating System Requirements}
\label{section:os-ext-analysis}

To implement only what is necessary, we will need to have a clear picture of the extension requirements of all the operating systems of choice.
This is not exactly as trivial as looking at requirement lists, because the operating systems themselves support many different configurations and could be adapted still to require fewer or more extensions.
We will first make builds of each operating system to get actual binaries that will later have to run on the target device.
These builds can be examined more closely using a disassembler to see which instructions are contained and thus find out the extension requirements.

\subsection{Memory Hierarchies}
\label{subsection:memory-hierarchies}

Before looking at operating system builds however we will make a quick excursion to memory hierarchies.

Advanced processors have advanced memory hierarchies, supporting multiple layers of memory.
Beyond the cache lies actual memory, but because memory layers that lie outside the processor often have multitudes of higher memory latencies, sophisticated systems were invented to lessen the resulting performance penalty.
Aspects other than performance are also important, such as memory protection and code relocation.
A memory hierarchy consisting of multiple layers of caches, together with a virtual memory system on top, is found in many different commercially available processors.

Virtual memory has many different advantages and disadvantages.
In essence, it serves to create virtual memory regions (memory pages) that are addressed using virtual addresses.
These virtual memory regions map to physical memory regions, but require a system for address translation.
This system comes in the form of a memory management unit.

Modern operating systems run on modern processors and thus make extensive use of these capabilities.
But besides interrupt and exception handling to process events (such as page faults, which occur when a memory page is not currently loaded in main memory), virtual memory support requires dedicated address translation hardware and setting up an entire memory hierarchy.

Even embedded Linux configurations (such as buildroot 2022.08 build we will soon take a look at) make use of virtual memory, as MMU support by processors becomes ever more ubiquitous.
But some microcontrollers do not contain an MMU and for this reason, some operating systems still target systems without virtual memory.
All RTOSs as well as the nommu embedded Linux configuration (using a forked buildroot 2022.05) we will look at run on systems without an MMU.

Without an MMU, code organization becomes more difficult:
In Linux, programs are usually compiled into the ELF binary format, which is a format used to store position-independent code.
When executing ELF executables, process-specific virtual memory is created, executable code and data sections are copied and the program counter is set to the entry point address.
Without virtual memory support, an application has to be loaded at fixed respective memory addresses because parts of it will expect that code to be there.

RTOSs circumvent this issue by taking all kernel and user space code and compiling them together into a single binary, organized in a way where no such issues can arise.

The embedded Linux build uses uClibc, which supports a different binary format, the binary flat (bFLT) format \cite{web-uclibc-bflt}.
This format is a descendant of the a.out format.
The original uClibc project supplies a tool called elf2flt that converts ELF to bFLT for nommu Linux targets \cite{web-uclibc-elf2flt}.
This is done by relocating sections and changing references to absolute addresses.

Implementing an MMU is not a small effort.
For this reason we are more interested in trying to run basic operating systems without having to implement an MMU and the rest of virtual memory.

\subsection{Build Systems}

The builds themselves have to be made using a RISC-V toolchain like the RISC-V GNU toolchain \cite{web-riscv-gnu-toolchain}.
This includes software tools like the GCC, the GNU Compiler Collection, configured to target RISC-V.
The disassembler itself is also already part of this toolchain.
Because RISC-V as an ISA has a modular design, this toolchain can be configured to target different target architectures with varying degrees of support for extensions.
Of course a compiler that already only targets the base integer instruction set won't generate any of the instructions that are part of any additional extension (with the exception of extensions like Zicsr).
But because the operating systems' build processes themselves sometimes assume special toolchain configurations, the builds will be created using as minimal a toolchain configuration as is possible.
The result will then be examined to see which extensions, if any, are actually required.

Furthermore, to examine a more lightweight embedded Linux configuration, we will make use of unofficial buildroot patches.
At the time of writing, buildroot (version 2022.08) only supports a limited set of embedded Linux configurations targeting RISC-V.
Going lower with the requirements would require patching buildroot and Linux, which is not a trivial effort.
Others have done this and as such, we will also consider an embedded Linux system produced by a forked buildroot that enables us to use uClibc in a nommu configuration.
One buildroot fork by Damien Le Moal has enabled buildroot to target a nommu configuration for 64-bit RISC-V embedded Linux \cite{web-buildroot-fork-damien-lemoal}.
GitHub user regymm has extended this effort to support 32-bit systems \cite{web-buildroot-fork-regymm}.
It is this second fork that targets 32-bit systems we will be using.

FreeRTOS is straightforward when it comes to build systems.
The operating system is offered as a set of sources and sample projects that can be adopted with their respective build systems, which often use Makefiles.
One could adopt the sources and integrate them into an existing build process, which makes FreeRTOS attractive for the purpose of running it on the MiRiV processor.

Zephyr RTOS is intended to be used in conjunction with its sophisticated build system which makes extensive use of CMake and a meta tool called West.
Getting Zephyr RTOS to support the MiRiV processor would also require creating respective hardware definitions in the format of DeviceTree sources.
Because such additional integrations are not necessarily a straightforward task and go against the previously stated goal of getting to a state where an operating system can be run on the MiRiV processor within the scope of a bachelor thesis, FreeRTOS will be preferred over Zephyr RTOS.
Thankfully due to existing support to similarly equipped RISC-V processors, we can still at least include Zephyr RTOS in our analysis.

Consequently, the operating systems in consideration are FreeRTOS, Zephyr RTOS and embedded Linux (the first configuration built using buildroot version 2022.08, the second using the buildroot fork version 2022.05 with the nommu patch).
In the embedded Linux systems, a busybox shell session is executed as the root process (no change in preconfigured default behavior when building embedded Linux using buildroot).
FreeRTOS is set up to make use of the pre-existing MiRiV UART software infrastructure, executing two threads that output to UART periodically.
Zephyr RTOS is set up to use the NEORV32 board configuration (mentioned earlier), which is itself a minimalistic soft-core 32-bit RISC-V SoC, with the same application as on FreeRTOS tweaked to use Zephyr RTOS libraries \cite{web-neorv32}.

This is acceptable for comparison purposes because of the way Zephyr RTOS is set up to produce a final build.
The DeviceTree specifications exist in order to be able to include code for features for peripherals, like for example UART or SPI.
When such features are used, code is included that is specific to the targeted hardware specification.
In the application running on FreeRTOS and Zephyr RTOS, only threading features to run two threads on a single core are used, the pre-existing MiRiV UART interface and library are re-used, thus not resulting in the inclusion of any NEORV32 specific code.

\subsection{Analysis Results}
\label{subsection:os-analysis-results}

To achieve the results that follow in this section, the four operating systems were built in configurations described in the previous section on build systems and the resulting binary files disassembled using \texttt{objdump} of the RISC-V GNU toolchain \cite{web-riscv-gnu-toolchain} to extract instruction counts and other relevant data.

\begin{table}[ht]
    \centering
    \begin{tabularx}{\textwidth}{r|cccc}
        & \multicolumn{4}{c}{OS configuration} \\
        \hline
        & & & buildroot 2022.08 & buildroot 2022.05 \\
        standard & FreeRTOS & Zephyr RTOS & Linux 5.17 & Linux 5.18 \\
        extensions & & & (glibc + busybox) & (uClibc + busybox) \\
        \hline
        M        & \catPE & \catPE & \catPE & \catSN \\
        A        & \catEX & \catEX & \catSN & \catSN \\
        Zifencei & \catEX & \catEX & \catSN & \catSN \\
        Zicsr    & \catSN & \catSN & \catSN & \catSN \\
        Counters & \catEX & \catEX & \catSN & \catEX \\
        F, D, Q  & \catEX & \catEX & \catSN & \catPE \\
        C        & \catPE & \catPE & \catPE & \catPE \\
    \end{tabularx}
    \caption{\label{table:rv-unpriv-classification}Classification of RISC-V unprivileged ISA standard extensions \cite{riscv-isa-vol-1} with respect to different operating systems (Legend: \catSN = strictly necessary, \catPE = performance enhancing, \catEX = expendable)}
    \vspace{1ex}
\end{table}

Table~\ref{table:rv-unpriv-classification} displays all sampled requirements of all the given operating systems in terms of RISC-V unprivileged specification ISA extensions.
Only stable ISA extensions were considered, as other extensions are currently in the process of being standardized and have yet to reach a stable state.
The aforementioned categorization scheme (Section~\ref{section:categorization-scheme}) using four categories is used to delineate each of the extensions that can be made and what category they fall under.
Likewise, Table~\ref{table:rv-priv-classification} shows whether auxiliary units such as an MMU or any of the RISC-V privileged specification ISA extensions are necessary.

\begin{table}[ht]
    \centering
    \begin{tabularx}{\textwidth}{r|cccc}
        & \multicolumn{4}{c}{OS configuration} \\
        \hline
        & & & buildroot 2022.08 & buildroot 2022.05 \\
        privilege & FreeRTOS & Zephyr RTOS & Linux 5.17 & Linux 5.18 \\
        modes & & & (glibc + busybox) & (uClibc + busybox) \\
        \hline
        M-mode & \catSN & \catSN & \catSN & \catSN \\
        S-mode & \catEX & \catEX & \catSN & \catCO \\
        U-mode & \catEX & \catEX & \catEX & \catEX \\
        H-mode & \catEX & \catEX & \catEX & \catEX \\
        \hline
        MMU    & \catEX & \catEX & \catSN & \catCO \\
    \end{tabularx}
    \caption{\label{table:rv-priv-classification}Classification of RISC-V privileged ISA privilege modes and whether they require an MMU \cite{riscv-isa-vol-2} with respect to different operating systems (Legend: MMU = memory management unit, M-mode = machine mode, S-mode = supervisor mode, H-mode = hypervisor mode, \catCO = convenient, \catSN = strictly necessary, \catEX = expendable)}
    \vspace{1ex}
\end{table}

\begin{figure}[ht]
    \centering
    \includegraphics[width=.95\textwidth]{os-base-histogram.png}
    \caption{Base integer ISA instructions across systems grouped by respective subcategories.}
    \label{fig:os-base-histogram}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=.95\textwidth]{os-ext-histogram.png}
    \caption{Instructions compared across systems grouped by their classification in the RISC-V unprivileged and privileged specifications, excluding base integer ISA instructions grouped in Figure~\ref{fig:os-base-histogram}.}
    \label{fig:os-ext-histogram}
\end{figure}

Figure~\ref{fig:os-base-histogram} is a comparison between all considered operating systems and their base integer instruction distribution within the disassembled linked binary.
Not only is the total count of base integer instructions displayed, but they are also grouped by different categories.
The scale was chosen to be logarithmic for the categories to be comparable to one another.
The category for arithmetic instruction include those that operate on three registers, as well as their immediate instruction form counterparts.
Branch and jump instructions are in two separate categories.
The U format (\texttt{LUI}, \texttt{AUIPC}) instructions were included in the category for jump instructions, as they are commonly used for subsequent jumps.
Load and store instructions are also in their own category.
Finally, there is a category for system instructions, which groups together those which are part of the base integer instruction set but do not belong in any of the aforementioned categories (\texttt{ECALL}, \texttt{EBREAK}, \texttt{FENCE}).

Figure~\ref{fig:os-ext-histogram} does not look at any of the base integer instructions, but instead considers all instructions which are part of some extension.
The Counters extension doesn't introduce any new instructions, but re-uses CSR instructions introduced by the Zicsr extension.
Still, it is possible to detect their use or label them using their respective pseudo-instructions (\texttt{rdinstret[h]}, \texttt{rdcycle[h]}, \texttt{rdtime[h]}).

As can be seen, the real-time operating systems compare favorably against embedded Linux configurations in terms of code size and required complexity by having a much smaller footprint and requiring markedly less extension support.
These are desirable properties to have when not all the complex features of the extendable instruction set can be implemented.

The embedded Linux systems practically assume the A extension, but profit from an implementation of other extensions.
When targeting a system with an MMU, S-mode (supervisor mode) instructions are necessary as well.
Not surprisingly, none of these lightweight systems contain any H-mode (hypervisor mode) instructions.
The Zicsr extension is strictly necessary for all operating systems.

FreeRTOS uses the least amount of distinct CSRs (mcause, mepc, mhartid, mie, mstatus and mtvec).
Zephyr RTOS makes use of the same CSRs, but uses additional ones as well (mcountinhibit, mcycle, mcycleh, minstret, minstreth and mtval).

As for Linux (glibc + busybox), it doesn't make use of any of the M-mode CSRs, but instead uses S-mode CSR counterparts (satp, scause, sepc, sie, sip, sscratch, sstatus, stval and stvec).
This is because normally, M-mode (being the highest privilege level) is reserved for system firmware.
Linux (uClibc + busybox) nommu on the other hand uses the same CSRs as FreeRTOS, with some additional ones (marchid, mimpid, mip, mscratch, mtval, mvendorid, pmpaddr0 and pmpcfg0).

It is important to note that for the RTOSs, the M extension is optional and only inserted, if a compiler that targets the M extension is used.
For the sake of our analysis, we have enabled the M extension.

\subsection{Eliminating Requirements}

These requirements are samples of what needs to be implemented with currently available software.
In theory, many of these requirements can be eliminated by patching these software tools and the operating systems themselves to run without them.

Doing so necessitates a considerable amount of changes to be made just to support the modified operating systems.
Any programs that are intended to be run on the deployed operating system could require such changes as well.
Whether it is more worthwhile to implement these extensions instead of intervening to make such an effort unnecessary by software adaptations is an interesting question.

For some of these extensions and operating systems, like the A extension for embedded Linux, eliminating their requirement is uneconomical.
In the case of the A extension for embedded Linux: The distributions were generated using buildroot, which compiles a custom toolchain that is then used to cross-compile the Linux kernel and busybox.
This entire process makes certain assumptions during compile time, where support for the A extension is one of them.
Thus, to eliminate the A extension requirement for embedded Linux, which might not by itself be hard to meet, one would have to adapt this process not to assume the A extension.
As an alternative, correct execution of instructions introduced by the A extension could be emulated by trapping on illegal instruction exceptions and then decoding and executing them in software.
But the extent of this effort is difficult to gauge and dependencies that might surface are not at first transparent.

\pagebreak

\section{Conclusion}
\label{section:overview-conclusion}

To provide a compact overview, in the following section the main insights with respect to RISC-V ISA extensions presented in this chapter are summarized.

\subsection{Unprivileged ISA Extensions}

\textbf{M Extension   }
While support for the M extension does increase performance, build tools do not necessarily explicitly require M extension support.
%As we will see, support for the M extension is not too big of a task, so it is up to us to decide whether to equip the MiRiV processor with M extension support.

\textbf{A Extension   }
For RTOSs and our simple two-task UART application, A extension support is not necessary.
For any Linux based operating system, such as both embedded Linux configurations we have looked at, this extension is strictly necessary.

\textbf{Zifencei Extension   }
The conclusion we have reached for the A extension also applies here.

\textbf{Zicsr Extension   }
All operating systems we have looked at require Zicsr.
This is not too surprising considering that besides being vital for managing the state and correct execution of each core, it is strictly necessary to support any kind of exception, interrupt and trap handling.
Although it is possible to roll one's own system to support interrupts for example, using the already defined CSRs provides a fairly standardized way for hardware and software to interact with the common goal of managing exceptions, interrupts and traps.
But CSRs that serve quite different purposes exist too.

\textbf{Counters Extension   }
This extension is only explicitly used by the embedded Linux (glibc) configuration.

\textbf{F, D, Q Extensions   }
For some applications, floating-point support is vital.
For others, it might not be necessary (we marked it as expendable in our OS analysis) and going without it saves many hardware resources and implementation efforts.
Operating systems also support processors that don't have floating-point support, so the implementation of these three extensions is optional.

\textbf{C Extension}
While it can offer a respectable improvement in code sizes and even performance (by virtue of relieving the instruction cache slightly), it is optional and implementing it takes effort.
There are not only many instruction variations added, but because 16-bit and 32-bit instructions are intermixed, instructions can now start on any 16-bit boundary, complicating the fetch and decode stages.

\subsection{Privileged ISA Extensions}

\textbf{M-mode   }
The most basic execution environment for all operating systems includes at least M-mode instructions.
As such, implementing M-mode is unavoidable.
There is not too much additional specific M-mode functionality to be implemented to support it, because there is an overlap with other unprivileged ISA extensions, such as Zicsr for example.

\textbf{S-mode   }
For standard Linux systems, such as the embedded Linux (glibc) configuration we have looked at, S-mode is a requirement.
This is because an execution environment with separated machine and supervisor levels is assumed.
For the implementation of embedded Linux (glibc) supporting at least a subset of the S-mode extension is unavoidable.

\textbf{H-mode   }
If we were to support operating systems with even more powerful features, systems that make use of a hypervisor for example, we would need to implement the H-mode extension.
Because it is a complex extension with many additions, while at the same time not being used at all by considered operating systems, we will not take a closer look at this extension.

\subsection{The Next MiRiV Processor}

Out of all possibilities of our preliminary choices and our analysis thus far, it is clear that supporting the execution of an operating system necessitates the implementation of a system for trap handling, at the very least.
It is also clear that embedded Linux has strictly greater set of requirements than the RTOSs in consideration, making it imperative to support RTOSs first.
Thus it was determined that the first step would be to support an RTOS.
Out of FreeRTOS and Zephyr RTOS, FreeRTOS was chosen for its simplicity, maturity, unopinionated build system, RISC-V support and clear implementation path.
%These reasons have already been outlined but are reiterated here as a reminder.

To support FreeRTOS, we will need to implement the Zicsr extension and a system for trap handling, at the privilege level of M-mode.
It was deemed appropriate to also include the M extension and a branch predictor, because this effort could be completed in a concerted effort with a colleague for whom it is also beneficial to include these extensions.

The effort required to go from an implementation that supports FreeRTOS to an implementation that supports embedded Linux is not entirely predictable, but was found to be high (due to necessary architectural additions and/or porting effort) and beyond the scope of this work.

% CHAPTER
\chapter{Architectural Considerations}
\label{chapter:architecture}

In Chapter~\ref{chapter:overview} we have examined available RISC-V ISA extensions, different operating systems and their precise requirements.
%In this chapter we want to take a quick tour of how an extension might be implemented and what costs and benefits such an implementation would entail.
In this chapter we want to take a quick tour of what properties an extension implementation needs to fulfill and what costs and benefits its implementation would entail, by examining which components, registers and instructions need to be added.
We want to constrain ourselves only to extensions that have proven to be strictly necessary for RTOSs and those that are used in the simpler of the two embedded Linux cases (using uClibc).

In addition, we will consider the M extensions and a non-ISA related microarchitectural extension that is (for our purposes) completely separate of any operating system and can serve to greatly improve system performance, namely branch prediction.
These two extensions, even though they are considered performance enhancing, are included because an implementation for them was developed in conjunction with a colleague.

These considerations are not necessarily related to the final implementation, which is further outlined in more detail in the next chapter.
Rather, this chapter serves to show briefly which design decisions can be made and what their respective trade-offs are.

\section{Exception, Interrupt and Trap Architecture}
\label{section:exc-int-trap-arch}

An operating system managing different tasks, such as external events triggered by peripherals, let alone its own events if they can occur simultaneously, assumes interrupts and exceptions to be supported.
Almost no modern processor runs without some kind of interrupt architecture.
The RISC-V specification defines the term \textit{exception} to refer to events caused internally by usual or unusual conditions associated with an instruction being executed, \textit{interrupt} to refer to externally caused events and \textit{trap} to refer to the transfer of control to a trap handler caused by either of the two \cite{riscv-isa-vol-1}.
They might be differentiated by categorizing exceptions as synchronous and interrupts as asynchronous, as the former is associated with the well-defined execution of a hardware thread (hart), while the latter can be triggered at any time externally.

There are different interrupt controller designs to support exceptions, interrupts and traps.
Many designs are possible, but we want to restrict ourselves to designs that are relatively standardized and thus predictable for operating system developers.
SiFive has thankfully already tackled this problem and produced a cookbook that lists three different architectures \cite{sifive-interrupt-cookbook}.
We will consider their interrupt controller designs first.

The most basic design is the Core Local Interruptor (CLINT), which has a fixed priority scheme and implements software, timer and external interrupts.

Then there is the Core Local Interrupt Controller (CLIC), which sports a more flexible configuration compared to the CLINT, at the cost of being a more complex design.
Advantages are reverse compatibility and programmable interrupt levels and priorities, among others.
Both CLIC and CLINT, as the name implies, are core local units and thus meant for single hart systems.

Going beyond this constraint is the Platform Local Interrupt Controller (PLIC), which is a more advanced design that may dispatch interrupts to more than one hart.
This makes it a global interrupt controller.
It has the highest amount of complexity, but also the highest amount of flexibility.

Because the CLINT is the best fit for our application, we will consider it in more detail.

\subsubsection{Core Local Interruptor (CLINT)}

The CLINT can be best understood by examining what CSRs it depends on and what functionality these CSRs are meant to add to the processor as a whole.
To start off, there are interrupt related CSRs (\texttt{mstatus}, \texttt{mcause}, \texttt{mie}, \texttt{mip}, \texttt{mtvec}, \texttt{mtvt}, \texttt{msip}).
Early in the boot flow, the CSR \texttt{mtvec} (M-mode trap vector) needs to be set up, in particular for exception handling in case any early errors occur, because it determines where the hart will jump to in order to handle the trap.
Its first two bits are set up to contain the mode \texttt{mtvec.mode = mtvec[1:0]}, while the higher order bits contain the trap vector base-address \texttt{mtvec.base = mtvec[31:2]}.
The two modes currently specified are the direct mode and vectored mode.
This can also be seen in Figure~\ref{fig:mtvec-register}.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{mtvec-register.png}
    \caption{Register layout for \texttt{mtvec} as can be found in \cite[Fig.~3.9]{riscv-isa-vol-2}.}
    \label{fig:mtvec-register}
\end{figure}

Direct mode (\texttt{mtvec.mode = 0}) traps all exceptions and interrupts to the same handler with no vector table, thus incurring an additional software cost to determine where to continue execution.
Vectored mode (\texttt{mtvec.mode = 1}) introduces a vector table that stores addresses of the trap handlers, which lowers latency.
The precise mechanism to address the correct handler is to use the base address \texttt{mtvec.base} stored in CSR \texttt{mtvec} with an added offset stored in CSR \texttt{mcause} (multiplied by a factor of four).

Each time an exception or interrupt occurs, the CSR \texttt{mcause} will be populated according to what has caused the transfer of control flow.
The highest bit denotes whether the source is an interrupt or an exception, while the lower bits contain the code associated with respective causes.

For direct mode, when jumping to \texttt{mtvec.base} the hart encounters a software defined method that determines what to do next.
In vectored mode, there is a contiguous memory region of jump instructions for all the different interrupt handlers at \texttt{mtvec.base} and the hart jumps to \texttt{mtvec.base} $+$ (\texttt{mcause.code} $\times 4$).
The jump instruction that lies at each interrupt handler offset stores the address of the actual interrupt handler.

\section{M Extension}
\label{section:m-arch}

The M extension adds multiplications and divisions.
To support this, a multiplication and division (muldiv) unit could be implemented in conjunction with additions to the decode pipeline stage to decode the new instructions.
All instructions use the R-type instruction format and have two source registers and a destination register, together with an opcode and a funct3 section that specifies the operation.
Figure~\ref{fig:m-ext-instructions} shows their layout.
For multiplications, the sources can be interpreted either as signed or unsigned integers.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{m-ext-instructions.png}
    \caption{M extension multiplication, division and remainder instruction layout as can be found in \cite[Ch.~7]{riscv-isa-vol-1}.}
    \label{fig:m-ext-instructions}
\end{figure}

Multiplication and division can be costly in terms of hardware resources, but division more so in particular.
Because of this, an extension called Zmmul is in the process of being specified which adopts only multiplication instructions of the M extension.

These instructions are analogous to other arithmetic instructions, so the muldiv unit can be incorporated into the execute stage right alongside the ALU.
Whereas the ALU completes in a single cycle however, it is to be expected that muldiv will take multiple cycles.
This brings a requirement for the control unit to handle subsequent stages of the pipeline while waiting for muldiv to complete its operation.

\section{A Extension}
\label{section:a-arch}

As we have seen, for this extension we would need to implement Load-Reserved/Store-Conditional (\texttt{LR.W} and \texttt{SC.W}) and atomic memory operation (AMO) instructions (\texttt{AMO(SWAP|ADD|AND|OR|XOR|MAX[U]|MIN[U]).W}), all of which use the R-type instruction format, but with the highest seven bits allocated as \textit{funct5}, \textit{aq} (acquire), \textit{rl} (release) fields.
This is further illustrated in Figure~\ref{fig:a-ext-instructions}.
For multi-core systems, implementing this extension is not at all trivial.
With our single-core system, atomics are easier to support however.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{a-ext-instructions.png}
    \caption{A extension Load-Reserved/Store-Conditional and atomic memory operation instruction layout as can be found in \cite[Ch.~8]{riscv-isa-vol-1}.}
    \label{fig:a-ext-instructions}
\end{figure}

These instructions serve not only as synchronization mechanisms, but they keep data consistent from irregular control flow.
If interrupts are implemented, special care needs to be taken that interrupts do not interfere with a sequence of instructions that are supposed to be atomic.
For single instruction operations, there is no problem, because they are inherently atomic in our single-core system.

Store-Conditional instructions use a return code to specify operation success or failure, used to retry on failure.
Using a Load-Reserved/Store-Conditional sequence is specified to eventually succeed, under certain pre-existing conditions.
For this reason inside a Load-Reserved/Store-Conditional sequence certain instructions are not allowed, the most recent Load-Reserved and Store-Conditional addresses must align and have the same data size, and other additional conditions.
This category of forbidden instructions also includes \texttt{JALR}, \texttt{FENCE} and \texttt{SYSTEM} (\texttt{(ECALL|EBREAK)}) instructions, loads, stores, backward jumps, taken backward branches, but also those of all other extensions (except for compressed versions of allowed instructions) \cite{riscv-isa-vol-1}.
For an implementation of this extension, we would thus need to ensure that under these conditions there is in fact eventual success of an appropriate Store-Conditional instruction.

The AMO instructions are not so simple:
"These AMO instructions atomically load a data value from the address in \textit{rs1}, place the value into register \textit{rd}, apply a binary operator to the loaded value and the original value in \textit{rs2}, then store the result back to the address in \textit{rs1}." \cite{riscv-isa-vol-1}
Because of these complicated semantics, they are not a small addition even to MiRiV (5-stage pipelined single-core implementation).
If we wanted to re-use all existing pipeline stages and their components we would run into a problem:
The value that is loaded would be read only in the memory stage and written back in the write-back stage, but the loaded value needs to be available already in the execute stage (where the ALU that would already support addition and binary logic resides) to apply the binary operator.

One way to implement these semantics is to disable interrupts and emit a sequence of base integer instruction set instructions that emulates the behavior of these atomic instructions.
Another would be to support all binary operators in the memory stage by immediately using the read value in combinatorial logic, possibly extending the critical path and bringing overall performance down.
Finally, for systems supporting trap handling an uninterruptible trap could provide an implementation for all the AMO instructions.

\section{Zifencei Extension}
\label{section:zifencei-arch}

The Zifencei extension serves as a mechanism of synchronizing data and instruction streams, typically because instruction memory needs to be reloaded because it was pre-fetched but modified in the meantime.
A simple pipelined implementation could ensure that it reads the updated value by flushing the pipeline and forcing the fetch stage to load all upcoming instructions again.
In our pipeline, another option would be to emulate the behavior of a branch, jumping to the following instructions unconditionally on an instruction fence and causing a pipeline flush.
The pipeline flush would be caused after all writes have been completed.

However, in our pipeline the instruction and data memories are separate and the instruction memory is not writable by the pipeline.
In order to comply with the Zifencei extension, we would first need to enable instruction memory writes, a feature that is not required anyway.
%This actually makes any compliance with the Zifencei extension impossible, without first enabling instruction memory writes.

\section{Zicsr Extension}
\label{section:zicsr-arch}

We have already outlined most characteristics of the Zicsr extension in Section~\ref{subsection:overview-zicsr-ext}.

To support operating on at least some CSRs using software, they are specified as memory-mapped registers to be operated upon using special instructions.
In the specification, entire address ranges are allocated, but not all addresses are populated yet.
Some CSRs are set up to store and set system information and they can do so using individual bits of the CSR register.
For example, the \texttt{mstatus} register is set up to contain 17 different fields (and four more reserved for future use), the layout of which is illustrated in Figure~\ref{fig:mstatus-register}.
Individual bits can be read-only.
Besides special use CSR addresses with their CSRs having their own unique semantics, there are also address ranges reserved for custom use.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{mstatus-register.png}
    \caption{Register layout for \texttt{mstatus} as can be found in \cite[Fig.~3.6]{riscv-isa-vol-2}.}
    \label{fig:mstatus-register}
\end{figure}

The entirety of all 4096 CSRs could be implemented as a RAM, but this has several practical disadvantages.
As we have established in Section~\ref{subsection:os-analysis-results}, only few CSRs are used in practice.
The semantics of individual CSRs can also differ, with some CSRs having individual read-only bits and some CSRs being exclusively managed by hardware components, further complicating the implementation as a RAM.

A more appropriate solution is to multiplex register access using the address to denote the register while implementing and thus supporting only a small selection of CSRs.

Zicsr can be implemented by adding a CSR unit that contains all supported CSRs, takes control signals such as the CSR operation (if any), an address and a register or immediate value.
Because this information either has to be decoded on the fly or passed on by the decode stage of the pipeline, the CSR unit can either operate in the decode or execute stage.
Data that is read back from the CSRs, if used, needs to be forwarded in case of data dependencies or the pipeline has to be stalled.
If the CSR unit is included in the execute stage, it can function like an alternative output data source alongside the ALU (and possibly a multiplication and division unit).

When getting a request for data to be written and read, the CSR unit can do so in a single-cycle.
Either the data supplied in the request is written, or it is used to mask (setting or resetting bits) previously held bits, in the next cycle, while at the same time data is read.

\section{Branch Prediction}
\label{section:bp-arch}

There already exists a primitive form of branch prediction in the current MiRiV pipeline implementation.
We want to consider branch prediction to see whether there are any significant performance gains to be had by way of relatively simple improvements, although we will not provide a detailed analysis of relative performance improvements.

\subsection{Pre-existing Predictor}

The MiRiV processor is designed not to stall whenever branch decisions come up, continuing execution on the basis of a branch prediction and determining the actual result of the branch instruction only later.
Complications like branch target calculation and the like can be avoided by following a simple strategy:
always assume there will be no change in control flow and continue execution.
This is akin to always predicting that a branch will not be taken, making the current implementation a static branch predictor.
If the result of the branch instruction later turns out to be a taken branch, then incorrectly fetched instructions within the pipeline are flushed.
That is to say, all instructions in the pipeline that were queued as a result of an incorrect static prediction are discarded.

The result of the branch instruction is determined in the fourth stage, the memory stage.
Currently a VHDL type containing the operations \texttt{BR\_NOP} (no branch instruction), \texttt{BR\_BR} (jump instructions), \texttt{BR\_CND} and \texttt{BR\_CNDI} (conditional and conditional inverted branch instructions) is passed down up to the memory stage, where the new branch target is determined and a signal to the fetch stage is asserted if the PC is to be taken from the memory stage, causing a flush of the pipeline.
Even for unconditional branches (jumps), the pipeline is only flushed later.
The conditional branches are determined using the zero flag of the ALU passed to the memory stage.

\subsection{Simple 2-bit Predictor}

To improve the prediction, a simple 2-bit predictor unit can be implemented.
A 2-bit predictor keeps a history of the last two branch results and predicts the next branch result based on that history.
There are many ways to implement such a unit, it can for example be placed inside the fetch stage, where it produces the next PCs while accepting prediction results from the memory stage.

Having the branch predictor as part of the fetch stage has the disadvantage of requiring additional hardware to decode incoming instructions.
If the branch predictor were part of a later stage it could make use of the computation in the decode stage, but placing the branch predictor at any later stage in the pipeline inevitably either causes delays or requires static predictions like those of the pre-existing predictor.
The price of additional hardware to decode incoming instructions for the branch predictor in the fetch stage is a small price to pay, although if these additions lie on the critical path the maximum operating frequency could be lowered as a consequence.

\begin{figure}[H]
    \centering
    \includegraphics[width=.5\textwidth]{2-bit-predictor.pdf}
    \caption{2-bit prediction state machine, highlighting the four possible states.}
    \label{fig:2-bit-predictor}
\end{figure}

To realize the 2-bit prediction, a state machine (as can be seen in Figure~\ref{fig:2-bit-predictor}) with four states can be implemented: two taken and not taken states, with weak and strong prediction respectively.
This state machine exhibits a kind of hysteresis, where two mispredictions are required for the branch predictor to change its prediction.
The branch predictor uses its current prediction to determine the PC for the fetch stage, but in case of mispredictions the memory stage has to issue a new PC and flush the pipeline.

Another aspect that can be improved upon compared to the pre-existing predictor is the prediction result of jump instructions.
When the branch predictor recognizes a jump instruction (\texttt{JAL}), it can immediately return the correct PC.
In case of jump register instructions (\texttt{JALR}), no such improvement can be made, because information of later pipeline stages is required.
To deal with this, the memory stage can still issue a new PC and flush the pipeline.

% CHAPTER
\chapter{Implementation}
\label{chapter:implementation}

We now want to discuss the details of our implementation.
This will be done by illustrating all the design decisions for the extensions of the MiRiV processor and by giving an overview of how FreeRTOS was included as an application to the pre-existing suite of software test applications.

\section{Extending MiRiV}
\label{section:preparing-miriv}

With all aforementioned changes made to the original MiRiV pipeline (Figure~\ref{fig:miriv-pipeline}), we end up with our new pipeline as shown in Figure~\ref{fig:miriv-new-pipeline}.
This again shows all the individual components of the MiRiV pipeline implementation.
Components newly added to the pipeline have been highlighted in light red color.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{miriv-new-pipeline.pdf}
    \caption{The modified MiRiV pipeline, with new extensions highlighted in light red. For the original MiRiV pipeline refer to Figure~\ref{fig:miriv-pipeline}.}
    \label{fig:miriv-new-pipeline}
\end{figure}

The most important extension to the pipeline is the exception, interrupt and trap architecture, in conjunction with the Zicsr extension.
For the exception, interrupt and trap architecture, we have already discussed design options that were made available by SiFive, with a particular focus on the core local interruptor (CLINT) \cite{sifive-interrupt-cookbook}.
The modified MiRiV pipeline makes use of the semantics and specifications of the CLINT and integrates its use with an implementation of the Zicsr extension in the form of a CSR unit.
Further, an auxiliary trap unit was added to the pipeline to facilitate correct changes in control flow and correct interoperation with the new CSR unit.
We will look at these architectural additions and changes in more detail now.

Because the CLINT, CSR and trap units all work in conjunction to enable more sophisticated control flows, their implementation is documented together in Section~\ref{section:exc-int-trap-implementation}.

Afterwards, we will consider implementation details for the other two pipeline extensions, the muldiv unit implementing the M extension in Section~\ref{section:m-ext-implementation} and the modified branch predictor in Section~\ref{section:bp-implementation}.

\section{Implementation of Exceptions, Interrupts and Traps}
\label{section:exc-int-trap-implementation}

To keep the implementation simple and avoid handling of relatively complex edge cases, interrupts and exceptions are designed to function in a way similar to normal instructions.
Normal instructions in the MiRiV pipeline are straightforward: after being loaded in the fetch stage, they flow through all the other stages of the pipeline, being handled by functional units in each stage in progressive fashion accordingly to fulfill their function.
This approach has the benefit of simplifying design by avoiding complex inter-dependencies, as each stage simply pushes their data forward to the next.
Some orchestration by a control unit is necessary, but the responsibilities of the control unit are kept small.

This leads to a natural implementation of interrupts and exceptions.
Software interrupts are caused by an instruction, so they already flow through the pipeline.
External interrupts can be seen as events that are signaled to the processor from some outside unit.
Even though this is the case, for our implementation they can still be analogous to software interrupts, by having the fetch stage receive these interrupt signals and passing them on like instruction data.
Exceptions on the other hand are entirely internal events that appear only mid-execution.
In the case of our implementation this refers to the decode stage, where an illegal instruction can make decoding impossible, and the memory stage, where a misaligned load and store can occur.
In the case of the two special software exceptions, those raised by instructions \texttt{ECALL} and \texttt{EBREAK}, they are detected in the decode stage and flow through the pipeline like any other instruction.

Each of these events requires treatment via a jump to an interrupt or exception handler, a so-called trap.
Specifically for the MiRiV pipeline, this irregular control flow is similar to the process of correcting a wrongly predicted branch result.
Only in later stages of the pipeline can it be ascertained that the branch was correct, which means that the instruction is pushed all the way to the memory stage until the processor can intervene to correct a possible misprediction.
This is accomplished by having the control unit flush respective pipeline stages, while the fetch stage assumes the corrected program counter signaled back to the fetch stage from the memory stage.

This already pre-existing mechanism to update the program counter in the case of a (perhaps not so) irregular event can be expanded to handle interrupts and exceptions as well.
Interrupts and exceptions can be passed down in the pipeline from stage to stage, until they reach the memory stage, where the irregular control flow event is then signaled to the control unit and the fetch stage, making interrupts and exceptions appear like any other misprediction.

Another reason that makes this implementation natural is the critical role of the CSR unit.
The CSR unit holds and controls all CSRs, which are essential parts of interrupt and exception handling.
We will describe all relevant CSRs and their role in this sort of irregular control flow before long.
The CSR unit does not only keep track of processor state, but it exposes processor state in the form of memory-mapped registers that can only be interacted with using special CSR instructions.
%This means it is necessary in a pipelined implementation to consider forward data dependencies for this software exposed side of the CSR unit.
It is not sufficient to track and manage processor state in a unit that is decoupled from individual pipeline stages.

Arguably, to uphold this principle of having only forward data dependencies, it is a good idea to place functional units as early in the pipeline as possible, but not earlier.
For example, if a functional unit required decoding the instruction first, then if the functional unit were placed in the stage that follows the decode stage, it can directly make use of the results of the decode stage.
In this example, the functional unit could still be placed earlier in the pipeline, but then it might require duplication of decoding hardware.
This is one reason to place the CSR unit in the execute stage of the MiRiV pipeline, which can then neatly make use of data prepared in the decode stage.
Upon handling a CSR instruction, none of the other functional units of the execute stage are busy and if any data is read via a CSR instruction, the output of the CSR unit is only a single 32-bit value, just like with the ALU, or the muldiv unit.

If interrupts and exceptions are handled only in the memory stage, like previously discussed, this placement of the CSR unit in the execute stage has another elegant property:
data that is required to decide whether to trap and where to trap to can be passed on forward by the CSR unit in the execute stage to the memory stage.

\subsection{Control and Status Registers (CSR Unit)}
\label{subsection:csr-unit}

The CSR unit was implemented to fulfill two functions:
read/write CSRs and pass on processor state information to the trap unit.

Unlike other functional units that don't require a clock, reset and stall signal, the CSR unit does, because it also contains the CSRs themselves.

To fulfill the first function, the CSR unit has been designed to take \texttt{op}, \texttt{csraddr}, \texttt{csrread}, \texttt{readdata}, \texttt{csrwrite} and \texttt{writedata} signals.
The operation is specified by \texttt{op} and the CSR is selected by \texttt{csraddr}.
In the case of a read, \texttt{csrread} is set to high by the decode stage and \texttt{readdata} is filled with the value contained in the selected CSR.
In the case of a write, \texttt{csrwrite} is set to high by the decode stage and \texttt{writedata} is populated either with a register or an immediate value.

Towards its second purpose, the CSR unit takes a \texttt{trapop} signal that contains all relevant information for the trap unit, populates it with CSR values and passes it on to the memory stage where the trap unit resides.

Now we will get into a description of CSRs that are necessary and how they are used.

\subsubsection{mstatus}

This CSR serves to store general processor state that is particularly relevant to correct handling of nested traps and correct execution in the presence of multiple privilege modes.
There is only one bit that is currently relevant to the MiRiV processor, the \texttt{mstatus.MIE} bit, which denotes whether or not interrupts are enabled globally (targeting only interrupts, not any of the exceptions).
This bit can be interacted with programmatically, but the pipeline also sets and clears it when certain events take place.
Notably, to disable nested interrupts (interrupts inside traps), this bit is set during a trap and cleared by an \texttt{MRET} instruction.
This instruction signifies that the trap handler has finished and normal execution can be resumed.
A short discussion on nested traps in general and why it is okay for our purposes to disable them is provided in the discussion of the trap unit in the next section.

Nested interrupts are already disabled by a separate mechanism that we will discuss soon, but it is still important to toggle this bit during a trap to expose this state information to software.

\subsubsection{mie}

This CSR serves to store whether interrupts are enabled, for each individual interrupt type.
Do note that, like the related CSR \texttt{mip}, this CSR is only concerned with interrupts, not any exceptions.
In general this CSR can toggle different interrupt types for different privilege modes, but in our case only M-mode interrupts are relevant.
There are only three bits relevant to our implementation, \texttt{mie.MEIE} (external interrupts enabled), \texttt{mie.MTIE} (timer interrupts enabled) and \texttt{mie.MSIE} (software interrupts enabled).

\subsubsection{mip}

This CSR is functionally equivalent to CSR \texttt{mie}, except for one difference: instead of storing whether an individual interrupt type is enabled, it stores whether an individual interrupt type is pending.
Pending interrupts are those that occur and can not immediately be handled.
One such instance would be an interrupt type that is currently disabled.
Information about pending interrupts is always stored, even when they are disabled.

When execution of pending interrupts becomes possible again and no conditions prevent a pending interrupt from taking place, the value of this CSR will cause an interrupt when passed to the trap unit similar to how an interrupt would normally be caused and additionally clear the pending bit.
In the case of multiple pending interrupts, we could either use a simple prioritization scheme that always chooses certain interrupts over others, or we could use some kind of arbitration.
Interesting arbitration schemes that could serve to prevent starvation would be a round-robin scheme, or a scheme that keeps a simple counter for each pending interrupt type and chooses those that were chosen least.
In the case of a prioritization scheme, the priorities could also be made configurable.
For now, a simple unconfigurable prioritization scheme has been chosen for the MiRiV processor with the following order of descending priority:
software interrupts, external interrupts and timer interrupts.
This ordering has been chosen, to prioritize program flow and instructions placed for debugging purposes over external requests that can be delayed without issue as well as timer interrupts that are expected to happen more frequently.
For a different system or application a different ordering might be desirable.

Like with the CSR \texttt{mie}, only the following bits are relevant: \texttt{mip.MEIP} (external interrupt pending), \texttt{mip.MTIP} (timer interrupt pending) and \texttt{mip.MSIP} (software interrupt pending).

\subsubsection{mtvec}

This CSR stores either the direct address of a generic trap handler, or the base address of the trap vector.
The lowest two bits (which for an address would be unavailable anyway to ensure correct alignment) denote how this CSR is set up.
Setting the mode to direct mode ($\text{MODE} = 0$) sets all interrupts and exceptions to jump directly to the stored address.
In vectored mode ($\text{MODE} = 1$), all interrupts and exceptions are set to jump to the base address with some offset denoted by the CSR \texttt{mcause} ($\text{BASE} + 4 \times \text{cause}$).

The CSR can contain a read-only value, but it can also be writable.
In general it is exposed to software at least for reads.
In a real context this CSR can be expected not to change too often, but because it can change if it is writable, its value has to be loaded and passed on.
Because the CSR unit is located in the execute stage, its value can be loaded and passed on to the memory stage, where information about the correct jump target is necessary.

\subsubsection{mepc}

This CSR is used to keep track of where execution was stopped before trapping.
One advantage of the design choice to have all exceptions and interrupts flow to the memory stage, is that the value of this CSR can easily be determined.
Because exceptions and interrupts, when they do take place and cause a change in control-flow and flush of the pipeline, accompany an instruction up until the memory stage, these exceptions and interrupts can be said to hijack the execution of the instruction they accompany.
The program counter that needs to be stored in the CSR \texttt{mepc} is always the instruction that is currently in the memory stage.
It will be necessary however to make sure that the correct program counter is retained and usable if a trap were to be caused during a pipeline flush.
This could be easily circumvented by delaying traps, such that a flush can never have an adverse effect on which program counter is chosen.
Upon encountering an \texttt{MRET} instruction, the value stored in this CSR is used to jump back to where normal execution was interrupted.

\subsubsection{mcause}

This CSR has an elementary purpose, to keep track of what has caused a change in control-flow.
When CSR \texttt{mtvec} is in direct mode, the cause can be used by software to determine the appropriate course of action.
In vectored mode, it is also used to directly compute the correct jump target.
When multiple interrupts or exceptions occur, it is set to the interrupt or exception of the highest priority.
Exceptions are prioritized over interrupts.
For multiple simultaneous exceptions, the RISC-V privileged specification specifies how they should be prioritized.
In the case of MiRiV, because exceptions have to be carried all the way to the memory stage by the originating instruction, as long as there are no instructions that can cause exceptions in multiple pipeline stages, there are no simultaneously occurring exceptions.

\begin{table}[H]
    \centering
    \begin{tabular}{r|r|r|l}
        Interrupt & Exception Code & Value & Description \\ 
        \hline
        0 & 2 & 2 & Illegal instruction \\ 
        0 & 3 & 3 & Breakpoint (\texttt{EBREAK}) \\ 
        0 & 4 & 4 & Load misaligned \\ 
        0 & 6 & 6 & Store misaligned \\ 
        0 & 11 & 11 & Environment call from M-mode (\texttt{ECALL}) \\ 
        1 & 3 & 2147483651 & Machine software interrupt \\ 
        1 & 11 & 2147483659 & Machine external interrupt \\ 
        1 & 7 & 2147483655 & Machine timer interrupt \\ 
    \end{tabular}
    \caption{\label{table:exc-int-listing}Listing of implemented exception and interrupt causes and their \texttt{mcause} values (adapted from Table~3.6 of the RISC-V privileged specification \cite{riscv-isa-vol-2}). Priorities are fixed and correspond to the order of occurrence in the table. The interrupt field corresponds to the unsigned value of \texttt{mcause[31]} and the exception code field to the unsigned value of \texttt{mcause[30:0]}, while the value field shows the unsigned value of the entire register.}
    \vspace{1ex}
\end{table}

Table~\ref{table:exc-int-listing} shows which exceptions and interrupts are implemented in the modified MiRiV pipeline, what their \texttt{mcause} values are and in which order they are prioritized.

Do note that the CSR \texttt{misa} is also implemented but not relevant to interrupt and exception handling.
Also note that exceptions that happen for instructions which are executed as part of a mispredicted branch get flushed as well when the branch is corrected, which is necessary to avoid handling exceptions for instructions that were not meant to be executed.
When a trap is executed and earlier stages are flushed, their exception vectors are also flushed.
Interrupts on the other hand do not get flushed, but in the case that a trap is taken for another exception/interrupt, the pending bit is set to delay handling of the interrupt until appropriate.

\subsection{Trap Generation (Trap Unit)}
\label{subsection:trap-unit}

Like previously mentioned, the memory stage determines whether a branch prediction was correct and if it wasn't, the misprediction is signaled to the control unit and the corrected jump target is transmitted to the fetch stage, causing a pipeline flush and change in control flow.
This mechanism can be re-used for traps, by making the memory stage change the control flow in the additional case of a trap (or a return from a trap).
For this purpose, a new functional unit has been added that has the purpose of deciding whether to trap, calculating the correct jump target and writing back to the CSRs.

The trap unit takes as inputs an enable signal, the program counter of the memory stage and a trap operation data structure \texttt{trapop}, which is a record containing all information necessary to determine whether or not to raise a trap, calculate the correct jump target and new values for the CSRs.
Inside a \texttt{trapop} are current values for CSRs \texttt{mstatus}, \texttt{mie}, \texttt{mip} and \texttt{mtvec} supplied by the CSR unit in the execute stage, a signal for \texttt{MRET} and two signal vectors for accumulated exceptions and interrupts that have passed through the pipeline.
These vectors accumulate exceptions and interrupts, because they are simply passed on through individual pipeline stages, with respective bits going high if an interrupt or exception is issued in any one of the pipeline stages.
This approach is essential for exceptions, because it keeps them consistent in the order in which they appear, so that no instruction that causes an exception in an earlier pipeline stage can raise an exception before its predecessing instructions that cause one only in a later stage.
A detailed explanation of this approach can also be found in \cite[Appendix C., Ch. 4]{comparch2019}.

Out of this, the trap unit determines the correct course of action and generates the aforementioned information.
The memory stage uses this information to update the control flow if necessary and writes back new values for the CSRs to the CSR unit, similar to how writes to the register file are handled.
Just like writes to the register file, if necessary the updated CSR values can be forwarded.

The trap unit is kept stateless.
This has the advantage of simplifying implementation and verification of the unit.
To ensure correctness, additional registers \texttt{cooldown} and \texttt{trapped} were added to the memory stage instead.
How they are used will be explained shortly.
The only relevant information from the perspective of the trap unit is whether or not it would be acceptable to generate a trap.
In the implementation of the MiRiV processor, there are two cases where trap generation needs to be delayed:
traps during pipeline flushes and during execution of other traps.

The first case is problematic because it complicates calculation of the correct program counter for the CSR \texttt{mepc} in the case of exceptions and interrupts.
As for the second case, raising any trap during execution of any other trap is explicitly forbidden, due to possible state inconsistencies.
Except for the case of exceptions raised during the execution of an interrupt handler, nested traps are not as useful, as we can see by looking at possible cases:
Interrupts are disabled during other interrupts regardless, because all implemented interrupts are part of the same privilege mode.
During traps caused by exceptions, it is acceptable to simply delay interrupts.
In all other cases, i.e. exceptions within an exception/interrupt handler, the exception would prove fatal if not handled specifically by the application, because the correct control flow information, such as the value of the CSR \texttt{mepc}, are lost.
Because the target application does not handle these special cases, nested traps are unsupported.

The \texttt{cooldown} and \texttt{trapped} registers are used to delay the generation of traps by disabling the trap unit for three clock cycles after a pipeline flush (representing pipeline bubbles) and while in a trapped state respectively.
The processor is in a trapped state after a trap has been raised and until an \texttt{MRET} instruction has reached the memory stage.
A cooldown after pipeline flushes ensures that no trap with an invalid program counter value can be generated, as well as no exceptions caused by invalid instructions from regions of memory that are not supposed to be executed.
This cooldown mechanism could be replaced by keeping track of the last valid program counter value for CSR \texttt{mepc} updates, as well as annotating instruction packets with a valid field to disregard exceptions of invalid instructions.
Instead, this simpler mechanism was chosen for simplicity, as the worst case scenario would only delay interrupts by three clock cycles.

Do note that because it is not needed for our use case, the \texttt{WFI} (wait for interrupt) instruction is implemented as a \texttt{NOP}.

\subsection{Core Local Interruptor (CLINT Unit)}
\label{subsection:clint-unit}

The final component of this interrupt and exception scheme is the interruptor itself.
The interruptor is a unit that collects information from outside the processor and issues interrupt service requests.

Software interrupts are issued when writing an appropriate value to the CSR \texttt{mip} and as such, software interrupt requests are already contained within the MiRiV processor with the addition of the CSR and trap units.
But the core itself does not generate any timer interrupt requests and external interrupt requests.

These other interrupt requests can only be signaled from outside the processor core.
There are several considerations as to what the signal source and specification should be.

The signals coming from outside the processor core can be asynchronous or synchronous, but need to stay high for at least an entire (synchronous) clock cycle.
The CLINT unit then scans for rising edges using a simple state machine, where each rising edge of the interrupt signal lines causes an interrupt request pulse that flows through the pipeline in the form of the accumulated interrupt vector.
The exact signal specifications are illustrated in Figure~\ref{fig:clint-waveform}.
Signals \texttt{exti} and \texttt{timi} represent interrupt lines coming from outside the processor core, while signals \texttt{ext\_irq} and \texttt{tim\_irq} represent core internal interrupt request pulses that flow through the pipeline.

This way, a wide range of signals can be supported as valid external interrupts and in particular, the signal produced by the FPGA's on-board button can be used without adding a dedicated unit.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\textwidth]{clint-waveform.pdf}
    \caption{External and timer interrupt request waveforms.}
    \label{fig:clint-waveform}
\end{figure}

\subsubsection{External Interrupt Requests}
\label{subsubsection:ext-irq}

For simplicity, the external interrupt was implemented using one of the on-board buttons.
These buttons offer an already debounced signal, simplifying implementation.
This is the only external interrupt source present as of now.
Even with multiple sources, they could all map to the same interrupt for which there would be only one interrupt handler.
To support more external interrupts, separate sources could be bound to unused bits of the interrupt vector.
One could also expose different devices using memory-mapped registers, have them all cause the same interrupt (a collective external interrupt) and use these registers in software to differentiate between the different sources.

\subsubsection{Timer Interrupt Requests}
\label{subsubsection:tim-irq}

Timer interrupts on the other hand are not entirely external but need to be programmable.
The RISC-V privileged specification \cite{riscv-isa-vol-2} has a section on \texttt{mtime}, describing the addition of two 64-bit memory-mapped registers and their semantics.
More specifically, it outlines the addition of a 64-bit memory-mapped register called \texttt{mtime}, and another one called \texttt{mtimecmp}.

Even for 32-bit architectures, these registers need to be 64 bits wide to support a high enough clock granularity without worrying about overflows, for a clock running at a constant frequency.
The \texttt{mtime} register is set automatically by hardware and encodes a strictly increasing unsigned integer value representing the current tick, starting from some arbitrary point in time.

The timer interrupt is then programmed by setting the \texttt{mtimecmp} to some point in the future, by adding an appropriate value to the current tick.
For this to work, according to the specification, the platform must provide a mechanism for determining the period of a tick.
Once the \texttt{mtime} register value has reached or exceeded the \texttt{mtimecmp} register value, the interrupt goes high and remains high until that is no longer the case.

Special care needs to be taken in a software implementation using \texttt{mtimecmp}, in order not to cause invalid transient interrupts with incorrect write sequences, as only 32-bits can be written atomically.
One such correct write sequence suggested in \cite[Fig.~3.29]{riscv-isa-vol-2} is illustrated in Figure~\ref{fig:mtimecmp-write-sequence}.
For the implementation, the registers are exposed to software as four memory-mapped registers \texttt{mtimelo}, \texttt{mtimehi}, \texttt{mtimecmplo} and \texttt{mtimecmphi}, at addresses \texttt{0xFFFF8000}, \texttt{0xFFFF8004}, \texttt{0xFFFF8008}, \texttt{0xFFFF800C} respectively.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{mtimecmp-write-sequence.png}
    \caption{Correct write sequence for \texttt{mtimecmp} as can be found in \cite[Fig.~3.29]{riscv-isa-vol-2}.}
    \label{fig:mtimecmp-write-sequence}
\end{figure}

These memory-mapped registers are not part of the processor core itself, but are instead contained in and exposed by a device in a separate address space.

\section{M Extension Implementation}
\label{section:m-ext-implementation}

The M Extension Implementation is described in Section~\ref{section:m-arch}.
It is implemented as a multicycle component that can cause pipeline stalls until its operation is completed.

Besides the small additions to the decode stage to support the new instructions, a unit called the muldiv unit has been added to the execute stage of the pipeline.
This unit functions analogously to the ALU, in that it takes an operation and two operands to output a single result, although without a zero flag or any other flags.

Two implementations for the muldiv unit are provided, which can be toggled using generics.
The first implementation is a naive one that is simple but assumes a fixed delay of 32 additional clock cycles to complete any operation.
The second is an improvement over the first one in that it stalls for only two clock cycles when multiplying, although it still stalls for up to 35 clock cycles when dividing (and only a single cycle for trivial cases such as division by one).
We will consider both separately.

\subsection{Naive Implementation}

The naive implementation is a simple implementation that mostly lets the FPGA development tools infer and generate multiplication and division functions.
As such, binary multiplication and division themselves don't have to be implemented, although their use still has to be integrated into the pipeline and reconciled with the semantics of the M extension specification.

However, in the context of FPGA development (using Intel Quartus with the Cyclone IV E family of FPGAs), to meet timing requirements the development tool has to be informed that these circuits are expected to take multiple cycles in order to keep the maximum frequency as high as possible.
This is done using multicycle path constraints.

Particularly for multiplication, on-chip DSP functions with low latencies can be used to avoid having to use multiple cycles for multiplication.
With division however, such a delay is unavoidable.
But because both multiplication and division set the result bit vector on which the timing constraint is placed, they both stall the pipeline for 32 clock cycles.

\subsection{Custom Implementation}

In general, the guiding principle for this custom implementation is to make the common case fast.
As application writers are aware that division is more costly than multiplication, division is already avoided wherever possible.
Our analysis in Chapter~\ref{chapter:overview} also shows that the frequency of M extension instructions themselves is not too high, compared to more prominent instructions.

The custom implementation still uses on-chip DSP blocks with low latencies for multiplication.
This enables binary multiplication to be done in a single clock cycle, although the state machine managing the logistics of the muldiv unit makes it necessary to stall for two clock cycles.
To avoid having multiplication be involved in the critical path, when multiplication itself is not so common (as we have seen), this is considered acceptable.

Division however is not implemented as an inferred operation, but using a custom intdiv unit that uses a simple shift-based binary integer division algorithm.
A state machine listens for incoming requests, idling until a division operation (\texttt{DIV[U]} and \texttt{REM[U]}) appears.
If the operation is a trivial case, the result of which can be determined in a single clock cycle, then the result is immediately produced and the unit stalls only for a single clock cycle.
These trivial cases are division by zero, one and negative one, for which the RISC-V unprivileged specification \cite{riscv-isa-vol-1} defines appropriate return values which can be seen in Figure~\ref{fig:m-div-special}.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{m-div-special.png}
    \caption{Semantics for division by zero and division overflow with $L=32$ for MiRiV, as illustrated in \cite[Tab.~7.1]{riscv-isa-vol-1}.}
    \label{fig:m-div-special}
\end{figure}

Any other set of division operands begins a process in which the result is calculated in a series of subtraction and shift operations.
At the end of these 32 cycles worth of subtraction and shift operations, both the quotient and remainder have been determined.
Finally an intermediate state determines the correct sign and then the result is returned to the muldiv unit.

Besides making the common case fast, this implementation can also be inspected and understood in comparison to the generated and opaque result of the FPGA tooling.
%It would also be simpler to use this custom implementation when not targeting an FPGA-based platform.
For these reasons this custom implementation is preferred over the naive one, although the naive one is left available as a choice for completeness.

\section{Modified Branch Predictor Implementation}
\label{section:bp-implementation}

The modified branch predictor is implemented as a simple 2-bit predictor exactly as specified in Section~\ref{section:bp-arch}.

Although its performance has not been extensively benchmarked, it has been compared to the pre-existing branch predictor using assembly test programs that were already part of MiRiV, where it has shown a definite improvement.

The modified branch predictor is also vital for this implementation for another reason.
Because the trap unit is disabled after a pipeline flush, pipeline flushes could lead to starvation of interrupts when the program enters an infinite loop that continually flushes the pipeline.
But because the branch predictor learns the correct course of action quickly in such a case, no mispredictions can cause pipeline flushes and the pipeline is guaranteed to reach a state where it can trap.

If this type of behavior is undesired, as can be expected for example for any program that enters an infinite loop after some setup to service interrupt request while in this idle state, the modified branch predictor should be used.
A better solution would be not to rely on the branch predictor, but for simplicity this solution was adopted.

\section{Preparing FreeRTOS}
\label{section:preparing-freertos}

FreeRTOS has been chosen as the operating system of choice to implement a demo with for our extended MiRiV processor.
But how does one use FreeRTOS?
In the remainder of this chapter we will discuss FreeRTOS, its integration into the suite of software test programs for the MiRiV processor and some interesting tensions between the hardware and software sides.

\subsection{FreeRTOS Overview}

As an RTOS, FreeRTOS is a kernel that will offer features only if required, in the spirit of providing only what is strictly necessary for an application and no more.
This is of course aligned with the usual goal of microcontroller development, where costs have to be driven low.
FreeRTOS resembles a highly configurable library where it is possible to optimize for code size reductions and speed, although only where it is possible without compromising important safety guarantees.

Because of this structure, FreeRTOS as an operating system is used by integrating it with the actual target application.
This makes FreeRTOS an explicit dependency where library features are used directly, as compared to exposing a POSIX-like API in the form of a standard library or similar approaches reminiscent of other operating systems.

\subsection{FreeRTOS Software Integration}

The base MiRiV processor already provided a suite of software tests, using simple and self-contained RISC-V Assembly and C programs.
They are converted into a set of target files that can then be loaded into the processor's memory using Makefiles.
This is achieved by taking a set of source files (\texttt{*.c, *.h, *.S}) and converting them to a set of target files (\texttt{*.imem.mif, *.dmem.mif}).
The detailed compilation flow (using the RISC-V GNU toolchain \cite{web-riscv-gnu-toolchain}) is illustrated in Figure~\ref{fig:compilation-flow}.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{compilation-flow.pdf}
    \caption{Compilation flow, from source files (\texttt{*.c, *.h, *.S}) to target files (\texttt{*.imem.mif, *.dmem.mif}).}
    \label{fig:compilation-flow}
\end{figure}

The target files are used to directly program the FPGA's on-chip memory, writing to separate instruction and data memory regions to initialize the MiRiV processor allowing it to run from a clean state.
Currently, both the base and extended MiRiV processor implementations use an address width of 14 (byte-addresses) inside the processor core, leading to a limitation of 16 KiB for both the data and instruction memories.
The (byte) address space is \texttt{0x0000} to \texttt{0x3FFF} for the data memory, and \texttt{0x4000} to \texttt{7FFF} for the instruction memory.
Because the target files consist of one hexadecimal word per line, one can easily check if either the instruction or data memory is too small by verifying whether their respective target files exceed 4096 lines.
Code size can be and is currently kept low by compiling the application in a way that optimizes for code size (GCC's \texttt{-Os} flag).

Of course a kernel with all the safety guarantees, functionality and performance that FreeRTOS offers is not so simple to implement and if many components are used, the resulting program quickly grows too large to respect these size limitations.
To give a rough overview of some of its features, FreeRTOS offers support for tasks and co-routines, queues, binary semaphores, counting semaphores, mutexes, recursive mutexes, task notifications (inter-process communication), stream and message buffers and several software timers.
When incorporating FreeRTOS into a project, one can choose which of these features to make use of and which to relinquish, as is often done to fit FreeRTOS code usage to the requirements of an application and no more.
In order for FreeRTOS to be useful, at the very least tasks have to be used, while most other features can be seen as optional dependencies.

The pre-existing build infrastructure was extended to link against a recent and stable version of the FreeRTOS kernel (version 10.4.6).
The only dependency pulled in, out of the above listed features, is support for tasks (\texttt{tasks.h} + \texttt{tasks.c}).
This inherently requires use of FreeRTOS's own list implementation (\texttt{list.c} + \texttt{list.h}).
FreeRTOS maintains several ports for different processor architectures, one of which is a port for RISC-V.
Thus, port-related code is also necessary (\texttt{port.c} + \texttt{port.h}, \texttt{portASM.S}).

Initially it was considered to also make use of software timers to avoid having to implement hardware timers, but code size constraints have made it easier to opt for a hardware solution.
More about this is mentioned in the next section.

Besides FreeRTOS related code and the application itself, the pre-existing basic libc implementation of the MiRiV suite of software programs (\texttt{util.c} + \texttt{util.h}) was re-used and extended appropriately as required.

\subsection{Connecting MiRiV and FreeRTOS}

Usually, the task of porting FreeRTOS to a new board is not so trivial.
To avoid having to implement any kernel code, which would be a long-winded process if it had to be tested, code re-use for existing RISC-V FreeRTOS ports has been a top priority.
As previously mentioned, the RISC-V FreeRTOS port can automatically install a trap handler and all other necessary utilities.
These work in a tried and tested way, if one adheres to the SiFive CLINT design.
This was one of the motivations behind the implementation of the SiFive CLINT design and it has proven to be a good choice.
While testing the main application and fixing bugs has been a challenging process, no RISC-V port specific code had to be extended or changed.

Because software timers could not be used, it was necessary however to offer hardware timers and configure them appropriately in the FreeRTOS configuration.
When using hardware timers, the RISC-V FreeRTOS port assumes memory-mapped registers \texttt{mtime} and \texttt{mtimecmp} to be present.
The base MiRiV platform already offered a hardware timer with the semantics required, but this hardware timer did not conform to the RISC-V privileged specification \cite{riscv-isa-vol-2}.
The changes made to support hardware timers are mentioned in Section~\ref{subsubsection:tim-irq}.
This hardware timer runs at a constant frequency of 1 MHz, the rate at which the value of \texttt{mtime} is incremented per second.
One should note that usually \texttt{mtime} would be writable, but to preserve backward compatibility with the old hardware timer and applications that have made use of it, \texttt{mtime} is kept read-only and only \texttt{mtimecmp} can be read and written.

\subsection{Configuring FreeRTOS}

FreeRTOS supports many different configurations, as it finds use with many real-world applications.
For this reason, there is a long list of configuration options that can be set to parameterize the specific board FreeRTOS runs on and how its features should be used.

\subsubsection{Timers}

In this configuration, addresses of memory-mapped registers such as \texttt{mtime} have to be specified.
One also has to configure the timebase of the hardware timers and the rate of ticks per second.
FreeRTOS supports features like time slicing, which is realized by making scheduling decisions during the so-called tick interrupt.
The tick interrupt is a timer interrupt that is raised at a constant rate and it counts how many ticks have passed, which FreeRTOS can use to make decisions about whether a blocked task has waited long enough.
There are several ways to block a task, the most common of which is to delay a task, either for a fixed time or until a certain point in the future.
The fixed task delay in ticks for example can be calculated by taking the total time in milliseconds and dividing it by the tick period in milliseconds.

\subsubsection{Debugging Options}

Another important configuration option, for debugging purposes, is to configure assertions.
FreeRTOS has several sanity checks in different places, which can call a user-provided function when one of these assertions fails.
It is possible for example, to use a function that prints to the UART serial console at runtime for which file and line number the assertion has failed, by using GCC related primitives.
There are also other facilities for debugging within FreeRTOS, like hooks (callbacks) as well as additional logging and trace features.

\subsubsection{Task-Specific Stack Memory}

What also has to be configured is the amount of memory that tasks have allocated for their task-specific stacks.
There is a main stack used by the whole application overall, but to ensure correct execution, FreeRTOS takes care to keep this data separate from what happens inside a task.
This memory has to be allocated somewhere and to realize this, FreeRTOS offers several options.
The most important distinction is dynamic and static allocation.

Tasks are not necessarily meant to but can be dynamic, as tasks can be paused, resumed, stopped, killed and re-created.
If dynamic task features are used, FreeRTOS needs to pull in another dependency for a heap implementation and the heap has to be configured appropriately.
To avoid this additional dependency to respect aforementioned code size constraints, dynamic tasks are not used.

Because the intended demo application only uses two tasks that continually block to wait and are then resumed again, both of which are created during initialization and run endlessly, we can get away with using only static tasks.
Static tasks don't use a heap implementation but instead require the application to provide static memory during initialization.
When using static tasks, one also has to take care of the initialization of the idle task.
Whenever no task is running, FreeRTOS executes the idle task, which requires its own stack.
In the RISC-V FreeRTOS port there is also a specific stack for the trap handler.

Lastly, there are stack protection features available in FreeRTOS.
Using no stack protection is the most performant option.
The most aggressive stack protection initializes task-specific stack memory with a magic value and watermarks the region of memory at the end, so as to be able to check if anything other than the task itself has modified this memory.
A compromise between aggressive stack protection and no stack protection is using just the watermark, which can also be configured.
FreeRTOS does not guarantee to detect stack corruption even with the aggressive stack protection feature enabled.

\subsubsection{Additional Hooks}

There is also the option of providing additional hooks to be called by FreeRTOS.
One such hook is the stack overflow hook, which is called when FreeRTOS manages to detect a stack overflow.
The FreeRTOS developer documentation suggests that fatal exceptions might be raised on some processors in response to stack corruption before the kernel overflow check can occur, which has been observed during testing with our implementation. \cite{web-freertos-stack-usage}
Then there are the tick hook, which is called at the end of every tick (timer interrupt) and the idle hook called towards the end of the idle task.
Currently the demo application does not support idle hooks, as there is an unresolved task-switching bug when using idle hooks.

\begin{figure}[htbp]
    \centering
    \includegraphics[width=\textwidth]{de2-115-execution.jpg}
    \caption{Terasic Altera DE2-115 board running the FreeRTOS demo application, with board and terminal output visible.}
    \label{fig:de2-115-execution}
\end{figure}

\newpage

\section{FreeRTOS Demo Application}
\label{section:freertos-demo}

In Section~\ref{section:preparing-freertos} we have set up the FreeRTOS application that was used for analysis of required extensions in Chapter~\ref{chapter:overview}.
Now we want to briefly look at the results and see what is happening on the processor during execution on the actual boards and in a simulation.

An example execution is illustrated in Listing~\ref{listing:demo}, where we can see terminal output transmitted by the extended MiRiV implementation over UART.
Figure~\ref{fig:de2-115-execution} illustrates this terminal output in a picture with the Terasic Altera DE2-115 board and Figure~\ref{fig:de0-nano-execution} with the Terasic Altera DE0-Nano board.
In this execution, FreeRTOS first sets up two static tasks \texttt{task1} and \texttt{task2}.
Because the tasks are created in a pending state, the scheduler immediately schedules the execution of both tasks and then transitions into an idle state to wait until further task executions have to be serviced.
These two tasks alternate, with \texttt{task1} set up to execute once every second and \texttt{task2} once every two seconds.

\begin{figure}[htbp]
    \centering
    \includegraphics[width=\textwidth]{de0-nano-execution.jpg}
    \caption{Terasic Altera DE0-Nano running the FreeRTOS demo application, with board and terminal output visible.}
    \label{fig:de0-nano-execution}
\end{figure}

\begin{listing}[htbp]
\begin{minted}{text}
Welcome to minicom 2.6.2

OPTIONS: I18n 
Compiled on Jun 10 2014, 03:20:53.
Port /dev/ttyUSB0, 08:41:49

Press CTRL-A Z for help on special keys

creating task 1
task 1 created successfully
creating task 2
task 2 created successfully
starting scheduler
task1 counter: 0
task2 counter: 0
task1 counter: 1
task1 counter: 2
task2 counter: 1
task1 counter: 3
\end{minted}
\caption{Serial console UART output of the extended MiRiV implementation executing the FreeRTOS task switching demo application, with \texttt{task1} set up to execute twice as often as \texttt{task2}.}
\label{listing:demo}
\end{listing}
%\begin{lstlisting}[caption={\label{listing:demo} UART output of the extended MiRiV implementation executing the FreeRTOS task switching demo application, with \texttt{task1} set up to execute twice as often as \texttt{task2}.}]

When \texttt{task2} becomes pending, \texttt{task1} becomes pending too.
Because the tasks have fixed priorities assigned, the task execution order remains deterministic.

Listings~\ref{listing:code-main}~and~\ref{listing:code-tasks} show the C source code of the \texttt{main} (executed upon boot), \texttt{task1} and \texttt{task2} functions.
The function \texttt{main} attempts to initialize both tasks, writing to UART via function \texttt{putstring} (wrapped in a \texttt{TRACE} macro that enables/disables output for easier visibility in simulations) and a notification that the scheduler is about to start.
Do note that there would be additional output after the scheduler returns, but it is not supposed to ever return.

\begin{listing}[htbp]
\begin{minted}{c}
int main(void) {
    TaskHandle_t task1_handle, task2_handle;

    TRACE(putstring("creating task 1\r\n"));
    task1_handle = xTaskCreateStatic(
        task1, // function name
        "task 1", // task name
        TASK_STACK_SIZE, // stack size
        NULL, // task parameters
        3, // priority (high number = high priority)
        task1_stack, // stack array
        &task1_buffer // task buffer
    );
    if (task1_handle) {
        TRACE(putstring("task 1 created successfully\r\n"));
    } else {
        TRACE(putstring("error creating task 1\r\n"));
        return 1;
    }

    TRACE(putstring("creating task 2\r\n"));
    task2_handle = xTaskCreateStatic(
        task2, // function name
        "task 2", // task name
        TASK_STACK_SIZE, // stack size
        NULL, // task parameters
        2, // priority (high number = high priority)
        task2_stack, // stack array
        &task2_buffer // task buffer
    );
    if (task2_handle) {
        TRACE(putstring("task 2 created successfully\r\n"));
    } else {
        TRACE(putstring("error creating task 2\r\n"));
        return 1;
    }

    TRACE(putstring("starting scheduler\r\n"));
    vTaskStartScheduler();

    // this segment shouldn't be reached!
    TRACE(putstring("this line should never be printed\r\n"));
    return 0;
}
\end{minted}
\caption{C definition of \texttt{main}}
\label{listing:code-main}
\end{listing}
%\begin{lstlisting}[language=C, caption={\label{listing:code-main} C definition of \texttt{main}}]

In the task definitions, we set up a counter and then enter an infinite loop where the counter value is printed via UART before incrementing it and blocking the task for the duration of the delay.
Other ways of blocking periodic tasks as these ones exist, but \texttt{vTaskDelay} was chosen for simplicity.

\begin{listing}[htbp]
\begin{minted}{c}
void task1(void * parameters) {
    // x ticks (delay) = y ms / portTICK_PERIOD_MS ms
    const TickType_t delay = 1000 / portTICK_PERIOD_MS;
    unsigned int counter = 0;

    for (;;) {
        TRACE(putstring("task1 counter: "));
        TRACE(putnumber(counter));
        TRACE(putstring("\r\n"));
        counter++;
        vTaskDelay(delay);
    }
}

void task2(void * parameters) {
    // x ticks (delay) = y ms / portTICK_PERIOD_MS ms
    const TickType_t delay = 2000 / portTICK_PERIOD_MS;
    unsigned int counter = 0;

    for (;;) {
        TRACE(putstring("task2 counter: "));
        TRACE(putnumber(counter));
        TRACE(putstring("\r\n"));
        counter++;
        vTaskDelay(delay);
    }
}
\end{minted}
\caption{C definitions of \texttt{task1} and \texttt{task2}}
\label{listing:code-tasks}
\end{listing}
%\begin{lstlisting}[language=C, caption={\label{listing:code-tasks} C definitions of \texttt{task1} and \texttt{task2}}]

\FloatBarrier

To understand the way FreeRTOS operates we will look at a few ModelSim simulation traces.

The way FreeRTOS initializes once at the start, goes idle once everything is set up and wakes up again to execute pending tasks can be seen in Figure~\ref{fig:freertos-demo-full}.
The internal \texttt{mtime} counter frequency and task delay have been changed to make results more visible in the ModelSim simulation traces.
In the trace we can see a lengthier initialization section, the execution of \texttt{task1}, then \texttt{task1} and \texttt{task2} together and finally \texttt{task1} again, with periods of inactivity in between.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{wave-freertos-demo-full.png}
    \caption{ModelSim simulation trace of the extended MiRiV implementation executing the FreeRTOS demo application, with initialization and two tick interrupts visible. Section 1 denotes the initialization section, sections 2 and 4 denote the execution of \texttt{task1} alone and section 3 denotes the execution of both \texttt{task1}.}
    \label{fig:freertos-demo-full}
\end{figure}

FreeRTOS is only woken up when a timer interrupt or an external interrupt occur.
To react correctly to these events, FreeRTOS configures the trap handler and regular timer interrupts in the initialization section of the execution of FreeRTOS.
These regular timer interrupts cause the execution of the automatically installed trap handler, which determines the interrupt cause and proceeds executing the scheduler.
With this mechanism FreeRTOS also enables pre-emption as well as other possibly desired features and the precise semantics for such interactions can be configured.

Figure~\ref{fig:freertos-demo-init} highlights the initialization section, where the two tasks are created.
The stack protection feature causes two calls to \texttt{memset} to be made to initialize the task-specific stacks with internally pre-defined values.
Not all traps are caused by timer interrupts, the other type of trap that can be seen in these traces are \texttt{ECALL} (environment call from M-mode) exceptions, which are used in the RISC-V FreeRTOS port for control flow purposes.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{wave-freertos-demo-init.png}
    \caption{ModelSim simulation trace of the extended MiRiV implementation executing the FreeRTOS demo application, with only initialization visible. Sections 1 to 4 denote the calls to \texttt{memset} to initialize the task-specific stacks of \texttt{task1}, \texttt{task2}, idle task and interrupt handler. Section 5 denotes the tail of initialization during which the scheduler is started, the first pending tasks executed and the timer configured for the next execution of a task.}
    \label{fig:freertos-demo-init}
\end{figure}

Figure~\ref{fig:freertos-demo-single-task} highlights the first tick interrupt with the first execution of \texttt{task1} as it becomes pending.
A timer interrupt is caused just as \texttt{task1} becomes pending and the scheduler switches to the execution of \texttt{task1}.
In the trace we can see by looking at \texttt{trapped\_reg} that the timer interrupt completes after a while, the processor returns using the \texttt{MRET} instruction, executes the task and invokes \texttt{ECALL} before going to sleep again.

The \texttt{ECALL} instruction is invoked when the task calls the delay function, at which point the kernel tries to determine whether there is another task pending or whether it should go into an idle state.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{wave-freertos-demo-single-task.png}
    \caption{ModelSim simulation trace of the extended MiRiV implementation executing a single pending task in a tick interrupt. Section 1 denotes the timer interrupt waking up the scheduler, section 2 denotes the execution of \texttt{task1} and section 3 denotes a transfer of control back to the scheduler.}
    \label{fig:freertos-demo-single-task}
\end{figure}

Figure~\ref{fig:freertos-demo-double-task} shows the second tick interrupt where both \texttt{task1} and \texttt{task2} become pending.
Similarly to before, the timer interrupt initiates the scheduler, \texttt{task1} is executed and returns control to the kernel.
But at this point the kernel also executes \texttt{task2} before going idle because \texttt{task2} is still pending.

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{wave-freertos-demo-double-task.png}
    \caption{ModelSim simulation trace of the extended MiRiV implementation executing two pending tasks back to back in a tick interrupt. Section 1 denotes the timer interrupt waking up the scheduler, sections 2 and 4 denote execution of \texttt{task1} and \texttt{task2} respectively and sections 3 and 4 denote transfers of control back to the scheduler.}
    \label{fig:freertos-demo-double-task}
\end{figure}

The demo was chosen in a way that highlights the correct operation of basic FreeRTOS features.
Going further, tasks could be extended to implement more complex application, perhaps to find out how much can be accomplished within the limits of schedulability and code size on the extended MiRiV processor.
But exploring the space of possibilities is left as a task for future implementation, perhaps as part of continuative work.

\chapter{Outlook and Conclusion}
\label{chapter:conclusion}

\section{Conclusion}
\label{section:conclusion}

Looking back, we have surveyed the landscape of available operating systems, giving an idea of the rough state of the RISC-V ecosystem and its support.
We have also taken a look at RISC-V extensions and other important parts of the specification, identifying exactly which parts are necessary for the execution of an operating system.
Finally we have detailed an implementation, following a strategy of making appropriate changes from both the hardware and software side of the existing MiRiV processor.

It is not a surprising result that supporting the execution of an operating system is no trivial matter.
The operating system implemented in the form of a simple demo application is FreeRTOS, showing the new MiRiV processor capabilities for context switching by running two tasks that print their own counter values to the UART serial console at precise and fixed intervals in time.
For a RISC-V processor running the 32-bit base integer instruction set (RV32I), such as the base MiRiV implementation, real-time operating systems (RTOSs) represent a reasonable goal when trying to identify, implement and verify all extensions necessary for their execution.
They also have more straightforward pathways towards an implementation, compared to embedded Linux.

As for the hardware extensions, an implementation for an exception, interrupt and trap architecture, the Zicsr extension, M extension and a new branch predictor (simple 2-bit branch predictor) were detailed.
By way of RISC-V's identification scheme, this makes the new MiRiV processor an RV32IMZicsr processor.
While many RISC-V extensions retain a neat modularity, it is interesting to note that the interrupt, exception and trap architecture cannot be decoupled from the Zicsr extension and both the RISC-V unprivileged \cite{riscv-isa-vol-1} and privileged \cite{riscv-isa-vol-2} specifications have to be consulted at various points to create a processor that is then actually capable enough for more complex behavior.
There are similar such interdependencies for other extensions.

With appropriate refactoring, the original goal of providing a set of extensions that students can plug in to use their own processor core and run FreeRTOS on it is achieved.
Students who want to tinker with FreeRTOS itself and implement more complex use-cases than our simple demo application can now do so.
However, they are limited in what they can do with the current implementation without further extending the processor.
With current instruction and data memory size limitations, FreeRTOS on the extended MiRiV processor is better suited at implementing static multi-tasking with few auxiliary library features used and it might be necessary to optimize for code size reductions and lower data memory usage.

\section{Future Work}
\label{section:future-work}

Before concluding our work, we want to look at several areas of the extended MiRiV processor and FreeRTOS demo application implementation that could be built on or improved in the future.

While the extended MiRiV processor can run FreeRTOS with the demo application, it is limited in the sense that the application for which FreeRTOS is used cannot be easily extended without violating instruction memory size constraints and depending on the application, perhaps even data memory constraints.
This is an obvious area in need of improvement.
Changing the size of the instruction and data memories is not a particularly complex change, but definitions, constants and addresses have to be updated at several points in the implementation of the hardware and software.
The address spaces of the on-chip RAMs for the instruction and data memory have to be updated, but depending on the change also those of the memory-mapped devices (UART and the hardware timer) and the JTAG bridge with which the FPGA on-chip RAMs are programmed from the host computer.
An effort was made to increase the size of the instruction and data memories, but because the FPGA's on-chip RAM resources are limited, this effort was not completed.

An idea for a more powerful test application that is feasible without adding any additional devices to the processor would be a simple shell, implementing the \texttt{fork/exec -> exit} flow.
This would require mature support for FreeRTOS tasks with dynamic memory allocation, which is also an area that could be improved upon.

Currently the extended MiRiV processor is equipped with the following devices: a pre-programmed button on the FPGA for external interrupts, a UART serial connection and a single hardware timer.
If they are not enough, additional devices could be implemented.
One could for instance support any number of buttons, switches, seven-segment displays or other hardware that is already made available by the FPGA board.
FreeRTOS can also run as a networked application, so one could imagine support for an Ethernet connection.
Another relatively self-contained addition to this list of devices would be support for graphics.
Support for graphics is quite well aligned with the capabilities of FreeRTOS, because the concurrency features of FreeRTOS can be used to realize simple games and other such use cases.

Other FPGA resources, such as the SDRAM, could be used for a future application.
This would greatly increase the size of available memory, to the point where it would even be feasible to run embedded Linux in the configurations we have discussed on the MiRiV processor.

One could also realize the Zephyr RTOS demo on the extended MiRiV processor.
A demo was included for our comparison between operating systems, but only FreeRTOS has been configured and tested to the point where the demo application runs on the extended MiRiV processor.

The most ambitious goal would be to extend the processor further to run embedded Linux (or as an even more ambitious follow-up, full-fledged Linux distributions).
Besides the already mentioned code size requirements, making it unavoidable to change the memory architecture, one would also need to implement support for several other unprivileged extensions or their emulation using exceptions, S-mode and an MMU.

The processor itself could also be improved in generic ways.
The branch predictor is an obvious area to target for further increases in performance.
%The caches could be tweaked.
%Support for more CSRs could be added.
The interrupt architecture could be extended to be more powerful and programmable (for instance, a SiFive CLIC could be added to replace the CLINT).
The processor could also be adapted to support multiple cores, although it would be a considerable effort.

\backmatter

% Use an optional list of figures.
\listoffigures % Starred version, i.e., \listoffigures*, removes the toc entry.

% Use an optional list of tables.
\cleardoublepage % Start list of tables on the next empty right hand page.
\listoftables % Starred version, i.e., \listoftables*, removes the toc entry.

% Use an optional list of alogrithms.
%\listofalgorithms
%\addcontentsline{toc}{chapter}{List of Algorithms}

% Add an index.
\printindex

% Add a glossary.
\printglossaries

% Add a bibliography.
\bibliographystyle{alpha}
\bibliography{thesis}

\end{document}
